# Name of Project


# Exploratory Data Analysis Report



## Notebook Table of Contents
1. [Project Overview](#project-overview)
2. [Imports](#imports)
3. [Data Loading and Initial Inspection](#data-loading-and-initial-inspection)
4. [Descriptive Statistics](#4-descriptive-statistics)
5. [Missing Values Analysis](#5-missing-values-analysis)
6. [Data Cleaning](#6-data-cleaning)
7. [Univariate Analysis](#7-univariate-analysis)
8. [Bivariate Analysis](#8-bivariate-analysis)
9. [Outliers Detection](#9-outliers-detection)
10. [Feature Engineering (Optional)](#10-feature-engineering-optional)
11. [Conclusions and Insights](#11-conclusions-and-insights)

---



### Project Table of Contents

#### Other Notebooks 

Example: [Notebook Name](./notebooks/exploratory/eda.ipynb)

---
#### Linking to Data Files

Example: [Data File](https://s3.amazonaws.com/codecademy-datasets/insurance.csv)

---

#### Linking to Scripts

Example: [Script Name](src/data_processing.py)

---

Example: [Read the README](./README.md)

#### Links to images or visualizations

![Correlation heatmap](../visualizations/eda/correlation_heatmap.png)

#### Links from images
[![Correlation heatmap](../visualizations/eda/correlation_heatmap.png)](../notebooks/exploratory/eda.ipynb)


Example logo: [![GitHub Logo](https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png)](https://github.com)




#### How to do:

##### Links to Project Files

- [Read the project README](../README.md)
- [Download the raw dataset](../data/raw/sales_data.csv)
- [Open the EDA notebook](../notebooks/exploratory/eda.ipynb)
- [View the data loader script](../src/data/data_loader.py)

##### Project Dashboard

- [![EDA](../visualizations/eda_icon.png)](../notebooks/exploratory/eda.ipynb)
  *Click the image above to open the Exploratory Data Analysis notebook.*

- [![Model Training](../visualizations/model_icon.png)](../notebooks/modeling/train_model.ipynb)
  *Click the image above to open the Model Training notebook.*

##### Example of Clickable Resized Image

<a href="../notebooks/exploratory/eda.ipynb">
    <img src="../visualizations/eda/correlation_heatmap.png" alt="Correlation Heatmap" style="width:300px;height:auto;"/>
</a>

*Click the image above to open the Exploratory Data Analysis notebook.*



## Introduction:

Within this project, there will be an analysis of the file 'insurance.csv' from the codecademy website. 

**Data Sources:**

Cite: 
Any notes about the data:


### Table of Contents
 
* [Goals](#goals) <a id="linkhandle"></a>
    * [Project Formats](#formats)</a>
    * [Data Structure](#section1_1) <a id="linkhandle"></a>
    * [Data Types](#section1_2) <a id="linkhandle"></a>
    * [Missing Values](#section1_3) <a id="linkhandle"></a>
    * [Outliers](#section1_4) <a id="linkhandle"></a>
    * [Categorical Variables](#section1_5) <a id="linkhandle"></a>
    * [Numerical Variables](#section1_6) <a id="linkhandle"></a>
    * [Summary Statistics](#section1_7) <a id="linkhandle"></a>
    * [Correlation Matrix](#section1_8) <a id="linkhandle"></a>
    * [Histograms and Boxplots](#section1_9) <a id="linkhandle"></a>
    * [Scatterplots and Correlation Matrix Heatmap](#section1_10) <a id="linkhandle"></a>


* [Data](#Data) <a id="linkhandle"></a>
    * [Loading the Data](#section1_1) <a id="linkhandle"></a>
    * [Data Information](#section1_2) <a id="linkhandle"></a>
* [Data Cleaning](#cleaning)
    * [Adding an Age Column](#section2_1)
    * [Checking the Education Variable](#section2_2)
* [Exploratory Data Analysis](#EDA)
    * [Big Picture](#section3_1)
    * [Purchasing Behavior by Income](#section3_2)
    * [More Purchasing Behavior by Income](#section3_3)
    * [Purchasing Behavior by Education and Income](#section3_4)
    * [Purchasing Behavior by Age](#section3_5)
* [Conclusion](#conclusion)
   




## Scoping:
The analysis will include the creation of a class housing functions to obtain and organize the file neatly to be readable for python data analysis as well as some python data analysis. 

From this analysis of patient info, there will be conclusions on the __columns__ as well as relations between.
    4 Sections:
        Project Goals: define high level objectives and intentions
        Data: Does this data have enough information to do this project?
        Analsis: stat tests and graphs
        Evaluation: conclusions, findings from our analysis



### Goals of the project:
    In this project the persepctive will be through a ___ analyst for __who hired you__. They want to ensure the ________ to maintain _______. Therefore, the main objectives as an analyst will be understanding characteristics about _____ and _____, and their relationship to ____. Some questions that are posed:

*   
    + 
    + 
    + 
    + 

### Data:
This project has data that came from this package from Codecademy. The first csv file has information about ____ and the second ____ has information on _____.

### Analysis:
In this section, descriptive statistics and data visualization techniques will be employed to understand the data better. Statistial inference will also be used to test if the observed values are statistically significant. Some of the key meetrics that will be computed include:
* 
    1. Distributions
    2. Counts
    3. Relationship between ____ 
    4. Status of _____
    5. Observations of _____


### Evaluation:
Lastly, in this section we revisit the goals, check our analysis and relate them to our questions. This section will also refelct what has been learned in the procss. If any of the questions were unable to be answered and if so why. Also, if anyone else has come to a different conclusion or used different methodologies. 


# Project Formats: <a class = "anchor" id = "formats"></a>


Standard Formats: 

Imports: 
absolute paths
group imports by standard library imports, third party imports, local imports and alphabetize within groups

Print and Others:
Use f-strings rather than % or .format()
if condition: not if(condition)
Progress bars for long-term processing projects
4 spaces for indentation, not tabs
each line of code is under 79 chars
avoid single letter variable names
strings: "" not '' unless in another string

columns are cleaned before data visualization so that data labeled firstword_secondword become FirstWord SecondWord
classes follow FirstSecond 



Use of Functions:
filename.function()


Data Tables: <a id = "formats_data_tables"></a>
axis = 0 or __axis = 'index'__ 
axis=0 -> apply function along rows
axis = 1 or __axis = 'columns'__
axis=1 -> apply function along columns

Work Flow:


Integrate with the workflow:
    + Pre-Commit Hooks: Setup pre-commit hooks to run code formatters automatically before each commit.
    + Continuous Integration (CI): include code formatting checks in your CI pieplines. Reject builds if code formatting violations are detected.

Version Control and Code Reviews:
    + Commit formatted code
    + Review formatted diffs

# EDA


## Project Overview
- **Objective**: 
- **Dataset**: 

## Imports

### Standard Library Imports

In [None]:
# Standard Library Imports
import os
import sys
import json
import math
import datetime
import collections


### Third Party Library Imports

In [None]:
# Third Party Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats


### Local Imports

In [None]:
# Local Imports


## Data Loading and Initial Inspection 

In [None]:
# Load the data
df = pd.read_csv('data.csv')

# Shape and structure
df.shape
df.head()
df.info()


## Descriptive Statistics 

In [None]:
# Summary statistics
df.describe()

# Categorical variable distribution
df["category_column"].value_counts()



## Missing Value Analysis

In [None]:
# Count and percentage of missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / df.shape[0]) * 100

# Visualization of missing values
sns.heatmap(df.isnull(), cbar=False)
plt.show()


## Data Cleaning

In [None]:
# Dropping rows with missing values
df_cleaned = df.dropna()

# Imputing missing values
df['column_name'] = df['column_name'].fillna(df['column_name'].median())


## Univariate Analysis

In [None]:
# Histogram for numerical variable
df['numerical_column'].hist(bins=30)
plt.title('Distribution of Numerical Column')
plt.show()


## Bivariate Analysis

In [None]:
# Correlation matrix heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


## Outliers Detection

In [None]:
# Box plot for outlier detection
sns.boxplot(x=df['numerical_column'])
plt.title('Outliers in Numerical Column')
plt.show()


## Feature Engineering 

In [None]:
# Create new features
df['new_feature'] = df['numerical_column_1'] * df['numerical_column_2']


## Conclusion and Insights

### Summary Statistics


### Key Findings
- Summary of the most important patterns found.

### Next Steps
- Potential further analysis or modeling.
