# **Coding Temple's Data Analytics Program**
---
## **Adv Python 2: Data Visualizations Assignment**

## **Part 1:**

### Getting Started
For today's assignment, you will work with a dataset all about Kickstarter Funding.To download the data, click [here](https://www.kaggle.com/datasets/patkle/most-funded-kickstarter-projects).

### Task 1: Imports
For this assignment, you will need to import pandas and matplotlib. Alias them according to the common naming conventions for both!

In [None]:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import csv
from itertools import groupby


### Task 2: Load the data
Using pandas, load the data into a dataframe object and view the first 20 rows

In [None]:
most_funded=pd.read_csv(r"C:\Users\ALIYA\OneDrive\Documents\Coding_Temple\Week 4\Day2\most_funded_feb_2023.csv")
pd.DataFrame.from_dict(most_funded).head(0)


### Task 3: Create Visualizations
Using matplotlib, create visualizations that answer the following questions about the data. Feel free to use whatever plots and graphs you would prefer to use. After you create the visualization, create a markdown cell below it, describing what relationship was plotted and what insights you gleaned. Make this more than a single sentence. 

**Bonus points for the more intricate and detailed visualizations!**

- What is the distribution of the `category_name` column?

In [None]:
category_counts = most_funded['category_name'].value_counts()

plt.figure(figsize=(10, 6))
plt.bar(category_counts.index, category_counts.values)
plt.xlabel('Category Name')
plt.ylabel('Count')
plt.title('Distribution of Category Names')
plt.xticks(rotation=90)
plt.show()

- Which country has the most projects?

In [None]:
country_counts = most_funded['country'].value_counts()
plt.figure(figsize=(10, 6))
plt.bar(country_counts.index, country_counts.values)
plt.xlabel('Country')
plt.ylabel('Project Count')
plt.title('Number of Projects by Country')
plt.xticks(rotation=90)
plt.show()

- What are the top 20 Kickstarter projects?

In [None]:
top_20 = most_funded.nlargest(20, 'pledged')

plt.figure(figsize=(12, 6))
plt.bar(top_20['name'], top_20['pledged'], color='orange')
plt.xlabel('Project Name')
plt.ylabel('Funding Amount')
plt.title('Top 10 Most Funded Kickstarter Projects')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

- Which parent category has the most funded projects?

In [None]:
parent_category_counts = most_funded['category_name'].value_counts()
plt.figure(figsize=(10, 6))
plt.bar(parent_category_counts.index, parent_category_counts.values)
plt.xlabel('Parent Category')
plt.ylabel('Project Count')
plt.title('Number of Projects by Parent Category')
plt.xticks(rotation=90)
plt.show()

- Create a box-plot for the `category_parent_id` column

In [None]:
plt.figure(figsize=(10, 6))
plt.boxplot(most_funded['category_parent_id'])
plt.xlabel('Category Parent ID')
plt.ylabel('Value')
plt.title('Box Plot for Category Parent ID')
plt.show()

- Create a user-defined function that allows a user to input a category and returns a visualization of your choice.

In [None]:
def category_funding(category_name):
    most_funded = pd.read_csv('most_funded.csv')
    
    category_data = most_funded[most_funded['category'] == category_name]
    plt.figure(figsize=(10, 6))
    plt.bar(category_data['name'], category_data['funding_amount'], color='orange')
    plt.xlabel('Project Name')
    plt.ylabel('Funding Amount')
    plt.title(f'Funding Amount Distribution for {category_name} Projects')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
user_input_category = input("Enter a category name: ")
category_funding(user_input_category)


## **Part 2:**

### Getting Started

For part 2, we will be working with the [US Public Food Assistance](https://www.kaggle.com/datasets/jpmiller/publicassistance) dataset. This dataset can be found on Kaggle via the hyperlink provided.

### Task 1: Imports

Import plotly.express. Alias according to industry standard.

In [None]:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import csv
from itertools import groupby


### Task 2: Load the data
Write a function which does the following:
- Input: 
    - Function should take a filepath (str) object
- Creates a dataframe using pandas
- Lower-case all column headers
- Replace the space with an underscore (_) in all column headers
- Output:
    - Cleaned dataframe object

In [None]:
def cleandata(filepath):
    data = pd.read_csv(filepath)
    data.columns = data.columns.str.lower()
    data.columns = data.columns.str.replace(' ', '_')

    return data
file=cleandata(r"C:\Users\ALIYA\OneDrive\Documents\Coding_Temple\Week 4\Day2\SNAP_history_1969_2019.csv")

### Task 3: Basic EDA
- Using plotly express, create a violin plot of each numerical feature in the data.

In [None]:
numerical_columns = file.select_dtypes(include='number').columns.tolist()

for column in numerical_columns:
        fig = px.violin(file, y=column, box=True, points="all", title=f"Violin Plot  {column}")
        fig.show()

- Create a histogram of the `total_costs(m)` column

In [None]:
fig = px.histogram(file, x='total_costs(m)', title="Histogram of total_costs(m)")

- Create a visualization detailing the ratio of non-null values to null values in the entire dataset

In [None]:
non_null_counts = file.notnull().sum()
null_counts = file.isnull().sum()
total_counts = file.shape[0]
ratios = non_null_counts / total_counts

ratios_df = pd.DataFrame({'Column': file.columns, 'Ratio': ratios})
fig = px.bar(ratios_df, x='Column', y='Ratio', title='Ratio of Non-Null Values to Null Values in the Dataset')
fig.show()

- Create a scatterplot to show the relationship bewtween the total benefits and total costs

In [None]:
fig = px.scatter(file, x='otal_costs(m)t', y='total_benefits(m)', title='Scatter Plot of Total Benefits vs. Total Costs')

### Task 4: Feature Engineering
Now, let's create a new column, representing the change in total_cost from the previous fiscal year.

After you have the code working in the cell below, try to incorporate it into your function from task 2!

In [None]:
file.sort_values(by='fiscal_year', inplace=True)

file['cost_change_from_previous_year'] = file['total_costs(m)'].diff()

file['cost_change_from_previous_year'].fillna(0, inplace=True)


### Task 5: Heatmap

Create a heatmap visualization. Return text in the heatmap and move the x axes to the top of the graph. 

In [None]:
fig = px.imshow(file.corr(), x=file.columns, y=file.columns,
                color_continuous_scale='Viridis', text=True)

fig.update_xaxes(side="top")

fig.update_layout(
    title="Heatmap of Correlation",
    xaxis_title="Features",
    yaxis_title="Features",
    autosize=False,
    width=800,
    height=800,
    margin=dict(l=50, r=50, b=100, t=100),
)


### Task 6: Communicate Results
In the markdown cell below, please give an explanation of what the visualization represents as if it were a presentation to someone who had never seen this data before!

**YOUR ANSWER HERE**