## 1. Structure & Guidelines

The most important thing to do is define your problem statement (with your partner). This will be your nexus and will help you choose the dataset. Ideally this is the problem that you work on for the rest of the project. Since this is a big decision, you can change the problem statement and the dataset in the next assignment but no changes after that.

### Where to look for a dataset
There are too many sources for me to name all of them. **Kaggle** is the most popular. To search you can just use google or **Google Dataset Search** specifically. A lot of universities have their datasets available, like the one I use in my example below, which can be a great resource too. 


### EDA Study
Here are some mandatory material to help you get a basic understanding:
- https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/
- https://www.youtube.com/watch?v=9m4n2xVzk9o

The sky is the limit with EDA, use this as a starting point and I expect you to go beyond. For instance I personally love this free book https://jakevdp.github.io/PythonDataScienceHandbook/ that dives deep into data science with python. <br>
This book is entirely in jupyter notebooks for even more code examples: https://allendowney.github.io/ElementsOfDataScience/


### Working with partners
To reiterate, you will decide the problem statement and the dataset together with you partner(s). I encourage you to work on the assignments together, disucss analytical processes and insights. If you are more experienced/knowledgable than your partner, please take the lead and help them understand any difficult concepts. 

**The idea is to foster collaboration and get support on the path to self-suffciency.**<br>
This means your assignment submissions, your final analyses and dashboard has to be completely your own. You should work on those independently. <br>
For example, discussing a specific assignent task is okay but copying your partners answers is not. Attempt to understand from them and write what you know so when I give my feedback it is valuable.

## 2. Assignment Questions/Tasks

1) Discuss & write down a problem statement
2) Find a Dataset(s) that will help you solve your problem
3) EDA Study: Go through the guides I link above and my example to get different perspective of how to approach EDA
4) Start your EDA by emulating the steps I take below and start forming hypotheses about the dataset and getting insights
5) Use 5 more visualizations or techniques of your choice that I dont use below
6) Write down insights about the dataset and how it relates back to your problem!!

## 3. Exploratory Data Analysis

This is the same example from class. I have kept things basic and barebones here so this can serve as a springboard for your analyses. In each step I have added some questions you should ask to get insights into the dataset. The answers to these and other questions that you ask might be through more statistical analysis and visualizations!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from ucimlrepo import fetch_ucirepo 

sns.set(style="whitegrid")

### 1. Data Loading & Quick Overview

In [None]:
#Your dataset here
adult_income_dataset = fetch_ucirepo(id=2) #details here https://archive.ics.uci.edu/dataset/2/adult, click on the import in python button to check it out
df = adult_income_dataset.data.original

In [None]:
# Display first few rows
df.head()

#### Questions to ponder: 

1. Does the data match your expectations or do you think you might need more information?
2. Do the columns/features align with your problem statement?
3. Any immediate signs of missing or corrupted data? 


### 2. Shape & Features

In [None]:
# Shape : (rows, columns)
print(f"Dataset shape: {df.shape}")

# Display all column names
print("\nFeature Names:")
print(df.columns.tolist())

#### Questions to ponder: 

1. Is the data large enough for the analysis?
2. Are there any duplicate columns, or columns with similar information or ones that need re-naming? (I renamed some columns in my dataset below)

In [None]:
#replacing "-" with "_"
df.columns = df.columns.str.replace("-","_")
df.columns

In [None]:
#Get unique target values
df['income'].unique()

### 3. Data Types & Missing Values

In [None]:
# understanding the datatypes
df.dtypes

In [None]:
# Check missing values
print("\nMissing Values Count:")
print(df.isnull().sum())

#### Questions to ponder: 

- Should we drop or impute missing values?
- Could missing data be an insight in and of itself?

### 4. Summary Statistics & Outlier Detection

In [None]:
#Summary Stats
df.describe()

#### Question to ponder
- Did you expect outliers? 
- Which features have unusually high or low values? What do they tell us about the data?
- Are there any suspicious patterns or extreme outliers?
- Do we need to drop or transform these outliers?

In [None]:
df["capital_gain"].value_counts().head(20)

### 5. Univariate Analysis

In [None]:
# Define features for visualization (Choosing the numerical features)
num_features = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

# Create a 2x3 grid for visualization
fig, ax = plt.subplots(2, 3, figsize=(18, 12))

# Iterate over features and plot
for i, feature in enumerate(num_features):
    row, col = divmod(i, 3)
    sns.histplot(df[feature], kde=True, bins=30, ax=ax[row, col])
    ax[row, col].set_title(f'Distribution of {feature}')
    
plt.tight_layout()
plt.show()

In [None]:
# Create boxplots for numerical variables
fig, ax = plt.subplots(3, 2, figsize=(18, 15))

for i, feature in enumerate(num_features):
    row, col = divmod(i, 2)
    sns.boxplot(y=df[feature], ax=ax[row, col])
    ax[row, col].set_title(f'Boxplot of {feature}')

plt.tight_layout()
plt.show()


In [None]:
# Create violin plots for numerical variables
fig, ax = plt.subplots(3, 2, figsize=(18, 15))

for i, feature in enumerate(num_features):
    row, col = divmod(i, 2)
    sns.violinplot(y=df[feature], ax=ax[row, col])
    ax[row, col].set_title(f'Violin Plot of {feature}')

plt.tight_layout()
plt.show()


In [None]:
# Define categorical features for visualization
cat_features = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race']

# Create a 2x3 grid for visualization
fig, ax = plt.subplots(2, 3, figsize=(18, 12))

# Iterate over categorical features and plot
for i, feature in enumerate(cat_features):
    row, col = divmod(i, 3)
    sns.countplot(data=df, x=feature, order=df[feature].value_counts().index, ax=ax[row, col])
    ax[row, col].set_title(f'Distribution of {feature}')
    ax[row, col].tick_params(axis='x', rotation=40)  # Rotate x-axis labels for better readability
    
plt.tight_layout()
plt.show()

#### Questions to ponder:

- Are the numerical features skewed or roughly normal?
- Which categories dominate in each categorical feature? What does that tell you about each feature? 


### 6. Bivariate Analysis

In [None]:
# Scatter plot for numerical vs. numerical
sns.scatterplot(data=df, x='age', y='income')
plt.title("age vs. income")
plt.show()

# Grouped bar plot for categorical vs. categorical
sns.countplot(data=df, x='education', hue='marital_status')
plt.title("Categorical Relationship")
plt.xticks(rotation=45)
plt.show()

# Box plot for numerical vs. categorical
sns.boxplot(data=df, x='education', y='age')
plt.title("Boxplot: age by education")
plt.xticks(rotation=45)
plt.show()

#### Questions to ponder

- Which numerical features are correlated?
- Do certain categories strongly associate with higher or lower numerical values?
- Any visible clusters or patterns in scatter plots?

### 7. Multivariate Analysis

In [None]:
# Create a correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.select_dtypes(include=['number']).corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()

#### Questions to ponder

- Which features show strong correlation?
- Should we remove or combine highly correlated features?
- Are there surprising correlations that warrant deeper investigation?

### 8. Next Steps

- Which features appear most important for the problem?
- What data cleaning or transformation steps remain?
- How will these insights guide the next phase (modeling, reporting, or business decisions)?

## Resources
- Another amazing free book I have used : https://greenteapress.com/thinkstats/thinkstats.pdf
- https://towardsdatascience.com/data-science-101-life-cycle-of-a-data-science-project-86cbc4a2f7f0/