
# Data Analysis Project

**Students Names:** Nate Tyler <br>
**Course Name:** 
<br>
**Date:** 5/2/25 <br>

---

### Project Overview

In this project, you will work in a team of two and utilize what we have learned in this class to preform an analysis on a dataset. 

The data you choose is up to you! I encourage you to select a dataset that fits your area of expertise. Some of the resources for datasets are listed below. Note that these are just resources I've used in the past. You are not limited to these resources, but it's a place to start.<br>
    <br>
    - [kaggle](https://www.kaggle.com/datasets)<br>
    - [data.gov](https://data.gov/)<br>
    - [google datasets](https://datasetsearch.research.google.com/)<br>
    - [SEC datasets](https://www.sec.gov/data-research/sec-markets-data/financial-statement-data-sets)<br>
    - [dataquest](https://www.dataquest.io/blog/free-datasets-for-projects/)<br>

Your dataset must have at LEAST 10,000 rows of data. Additionally, it should have enough attributes to allow for a meaningful analysis and level of complexity (e.g., a dataset with just two columns for "name" and "birthdate" likely is not going to be enough.)

Once you have selected a dataset, you will be required to perform an analysis

 with a large dataset of at least 10,000 rows. The goal is to:
- Import and explore the dataset.
- Clean and prepare the data for analysis.
- Derive at least 6 key insights from the data using Python and Python libraries. You are welcome to add more coding blocks and more insights. 
- Communicate these insights clearly, professionally, and explain how they can inform decision-making.

---

If you and your group want to utilize the power of Github and use branches, here is the [documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-branches) on how to do so.

The cells below are a guide. Feel free to add code or markdown cells as necessary. 

## 1. Importing Necessary Libraries

In [4]:
# Code block to import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statistics as stats



## 2. Loading the Dataset

Load your dataset here. Make sure the dataset has at least 10,000 rows.


In [5]:
# Code block to load the dataset
df = pd.read_csv('HopitalProviderCostReport_2022_FinalProject.csv')



## 3. Data Exploration

Explore the dataset by checking its dimensions, data types, and any missing values.


In [29]:
# Code block for data exploration
print('Data Set Info:')
#print(df.head()) #Shows 5 rows 

#This gives me the shape of the data set and shows the null values. How many & where theyre located. 
print(f"Initial Dataset Shape: {df.shape}") #Shows me I have 6,064 rows and 117 Colunns 
df.info()
# Calculate total nulls
total_nulls = df.isnull().sum().sum()

print(f"Total number of null values in the dataset: {total_nulls}") #I have 205,738 Null values accrossed the whole sheet

#nulls per column 
print("\nNull counts per column:")
print(df.isnull().sum())

#Show what percent of the sheet is null values
null_percent = total_nulls/ (6064*117)
print(f'Percent of Nulls in data sheet: {null_percent:.2f}')

#There are nearly all the types of data types present in this data set. We have Floats, Integers and objects 


Data Set Info:
Initial Dataset Shape: (6064, 117)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6064 entries, 0 to 6063
Columns: 117 entries, rpt_rec_num to Stand-Alone CHIP Charges
dtypes: float64(103), int64(4), object(10)
memory usage: 5.4+ MB
Total number of null values in the dataset: 205738

Null counts per column:
rpt_rec_num                             0
Provider CCN                            0
Hospital Name                           0
Street Address                          3
City                                    0
                                     ... 
Cost To Charge Ratio                 1464
Net Revenue from Medicaid            1679
Medicaid Charges                     1691
Net Revenue from Stand-Alone CHIP    5078
Stand-Alone CHIP Charges             5058
Length: 117, dtype: int64
Percent of Nulls in data sheet: 0.29



**Guiding Questions:**
- What columns are available in the data?
- Are there any columns with missing values?
- What data types are present?


***INSIGHTS TO INVESTIGATE*** <br>
1. Total Costs by Rural vs. Urban<br>
2. Net Income by State Code<br>
3. Net Income by CCN Facility Type<br>
4. Total Salaries by State Code<br>
5. Cost of Charity Care by CCN Facility Type


## 4. Data Cleaning/Engineering

Clean the data to ensure consistency and handle any missing values or outliers. Also use this space to add attributes that bolster your analysis. 


In [7]:
# Code block for data cleaning/engineering
#Fix Categorical Data/nulls

categorical_data = [' Rural Versus Urban', 'State Code', 'CCN Facility Type']
for col in categorical_data:
    df[col] = df[col].fillna('Unknown') #Filling with unknown beacuse I dont want to drop any colunms or rows

#Replace numeric data with medians 
numeric_data = ['Total Costs', 'Net Income', 'Total Salaries From Worksheet A', 'Cost of Charity Care', 'Number of Beds']
for col in numeric_data:
    medianvalues = df[col].median
    df[col] = df[col].fillna(medianvalues)

#Change numeric to floats for later math 
for col in numeric_data:
    df[col] = pd.to_numeric(df[col], errors='coerce').fillna(df[col].median())

# Convert categorical columns to string
for col in categorical_data:
    df[col] = df[col].astype(str)


**Instructions:**
- Describe each cleaning step you take.
- Add any other attributes to improve your analysis. 
- Explain how these changes will improve the analysis.



## 5. Data Visualization

Use visualization to better understand the data and spot trends. Include at least two visualizations.


In [8]:
# Code block for visualization 1


In [9]:
# Code block for visualization 2


**Guiding Questions:**
- What patterns do you see?
- How can these patterns contribute to further insights?



## 6. Deriving Insights

In this section, derive six insights from the data. Each insight should have:
- A brief description of what you are examining.
- The code used to calculate or visualize it.
- A short analysis of the results and how they contribute to understanding the data.
- Visuals can be used to aid your insights as well.


### Insight 1

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


In [10]:
# Code for Insight


### Insight 2

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


In [11]:
# Code for Insight


### Insight 3

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>

In [12]:
# Code for Insight


### Insight 4

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>

In [13]:
# Code for Insight


### Insight 5

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>

In [14]:
# Code for Insight 


### Insight 6

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


In [15]:
# Code for Insight

If a group of 3, put insights 7-9 below.


## 7. Summary and Conclusions

Summarize your findings and highlight the most impactful insights you derived from the data.

**Guiding Questions:**
- What key takeaways did you learn from this data?
- How can these insights inform future decisions?
- Are there any limitations in your analysis?


**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


## 8. Projections

Finally, tell us something that can be learned from this analysis to make better decisions going forward. 

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


**Instructions for Submission:**

- Submit the completed `.ipynb` file, dataset, and link to dataset in Canvas. (link only is fine if dataset is too big)


## Grading Critiria

Below are some of the critiria that I will be using when grading the final case. Keep in mind, no one part has a specific weight.

| **Section**          | **Description**                                                                                                                                             |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Dataset Quality**  | Does the dataset meet the length and complexity requirements(Unless previously cleared and discussed with insturctor)? Note that redundant datasets used in prior assignments result in an automatic 15% penalty.                                                                                               |
| **Data Ingestion**   | Demonstrates a well-structured process using Python to import, explore, and clean the data.                                                                 |
| **Data Cleanup**     | Clearly explains how data was cleaned (if necessary) to prepare it for analysis.                                                                            |
| **Analysis**         | - Utilizes a variety of modules and functions to analyze the cleaned dataset.                                                                               |
|                      | - You and your partner must provide at least six (6) insights derived from the analysis. (9 insights if a group of 3)                                                                   |
|                      | - Insights should reveal something not initially obvious from the raw data.                                                                                 |
|                      | - Present each insight with code execution, printed data tables, visuals, etc., and explain the process and significance.                                   |
| **Conclusion**       | Synthesizes your insights to explain the overall story the data tells.                                                                                      |
|                      | Discusses the importance of the data and general conclusions that can be drawn.                                                                             |
| **Projection**       | Takes the analysis further by considering its application in a real-world context.                                                                          |
|                      | Provides recommendations for stakeholders on how the findings can inform better, more informed decisions in the future.                                     |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Overall Presentation Quality**                     | How descriptive your explanations in the file are and the verbal presentation piece.                                    |
