# <h1 align="center">Exploratory Data Analysis</h1>
<h3 align="center">Analyzing and visualizing data in AWS</h3>

## Pre-Reqs

### To start off what needs to be known for EDA

1. **Python** - But the test will not test your Python knowledge.
2. **Scikit_learn** Python library for machine learning models

#### Types of Data:
- Numerical
- Categorical
- Ordinal

---

### Data Distributions


#### 1. A Probability Density Function (PDF) 

<h4 align="center">Represents the probability distribution of a continuous random variable. Gives the likelihood (density) that the variable falls within a given range.</h4>
<img src="../Figures/EDA/PDF.png" style="width:400px; display:block; margin:auto;">


#### 2. A Probability Mass Function (PMF) 
<h4 align="center">Gives the exact probability of a discrete random variable taking on a specific value. NOT CONTINOUS</h4>
<img src="../Figures/EDA/PMF.png" style="width:400px; display:block; margin:auto;">


#### **Poisson Distribution** is another Example of a PMF it works with discrete values. We cant sell 1/2 of a car, so PMF is a distribution we will use. 

---

### Seasonality (repeating patterns over time)

<h2 align="center">Seasonality is looking at a period in the data set that presents a repetative pattern.</h2>
<img src="../Figures/EDA/Seasonality.png" style="width:400px; display:block; margin:auto;">

### Trend (long-term movement)

<h2 align="center">Trends is looking at the entire data set to see a upward or downward trend.</h2>
<p align="center">In this Example the trend is upward</p>
<img src="../Figures/EDA/Trends.png" style="width:400px; display:block; margin:auto;">

### Dataset - Sesonality = Trend

<h3 align="center">If we calculate the seasonality and subtract it from the entire dataset it will give us the smooth line trend of the data</h3>
<p align="center">In this Example we subtract the sesonality from the dataset to get the smooth trend curve</p>
<img src="../Figures/EDA/TrendSeason.png" style="width:400px; display:block; margin:auto;">

### Time Series Models

Time series data consists of **trend**, **seasonality**, and **noise**. 

- Use an **additive model** when seasonal effects stay constant over time (e.g., sales increase by a fixed amount each year).
- Use a **multiplicative model** when seasonality scales with the trend (e.g., sales double during holidays as the business grows).


---

## AWS Athena (Serverless)(Near-RealTime)


AWS Athena is a serverless, interactive query service that allows you to run SQL queries on data stored in Amazon S3 without needing to set up or manage databases or servers.

**Use Case:** when you have large datasets in S3 and need quick SQL queries without managing a database.

 **Cost Breakdown 💰**
- **Pay-as-you-go**: No upfront costs, pay only for queries run.
- **$5 per TB scanned**: Charges are based on the amount of data scanned by queries.

 **Ways to Reduce Costs 💡**
- **Use columnar storage formats** (e.g., **ORC, Parquet**).
  - Improves query performance.
- **Partition and compress data** to optimize query efficiency.

- **AWS Glue** (for data cataloging) and **Amazon S3** (for data storage) have their **own separate charges**.

🔹 **Tip:** Optimize your data storage strategy to minimize the data scanned and reduce costs!


---

## AWS QuickSight (Serverless)(Visualization of Data)

### An application that helps generate visualizations and business insights from large datasets efficiently.

Machine Learning Capabilites:
- Anomaly detection
- Forecasting
- Auto-narratives

### QuickSight Visual Types

AutoGraph: based on the properties on the data gentrates a graph associated to the data, but is not always right. 

Bar Charts
- For comparison and distribution (histograms)

Line graphs
- For changes over time

Scatter plots, heat maps
- For correlation

Pie graphs, tree maps
- For aggregation

Pivot tables
- For tabular data

### **Know which charts are for which dataset, might be a question on the test**
### Example: Based on this data what is the best visualization for it? 

---

## AWS Elastic Map Reduce (EMR)

**Amazon EMR (Elastic MapReduce)** is a managed big data platform that allows you to run Apache Spark, Hadoop, Presto, Hive, and other big data frameworks at scale on AWS. It is designed for processing large-scale datasets across a distributed computing environment.

It is designed for processing **massive datasets**, making it easier to **prepare, normalize, and scale** data before feeding it into machine learning algorithms. 

### 💡 Spark is usually preferred over Hadoop for new projects because it's faster and supports real-time processing.

### EMR Cluster Architecture & Workflow

<h4 align="center">Represents the default Architecture of EMR, core and task nodes can be added to make the Architecture bigger. Also you can run mutilple different EMR clusters for different tasks</h4>
<img src="../Figures/EDA/EMRArch.png" style="width:600px; display:block; margin:auto;">


---

### Spark ML Lib

#### **What is Spark MLlib**
A key advantage of Spark MLlib is that it enables **Common ML algorithms to run on distributed computing clusters**. In contrast, libraries like **Scikit-Learn** are designed for single-node execution and do not natively support distributed computing.

#### **🚀 Machine Learning Algorithms in Spark MLlib**
Spark MLlib provides a variety of machine learning models optimized for large-scale data processing:

- **Classification**: Logistic Regression, Naïve Bayes  
- **Regression**: Linear and Non-Linear Regression Models  
- **Decision Trees**: Random Forest, Gradient-Boosted Trees  
- **Recommendation Systems**: Alternating Least Squares (ALS)  
- **Clustering**: K-Means  
- **Topic Modeling**: Latent Dirichlet Allocation (LDA)  
- **ML Workflow Utilities**: Pipelines, Feature Transformation, Model Persistence  
- **Dimensionality Reduction**: Singular Value Decomposition (SVD), Principal Component Analysis (PCA)  
- **Statistics**: Summary Statistics, Hypothesis Testing  

💡 **Key Takeaway**: If you need scalable machine learning for **big data**, Spark MLlib is a better choice than traditional libraries like **Scikit-Learn**.


---

### WorkFlow for EMR Model training

#### ✅ Spark on Core Nodes → Handles distributed data preprocessing & ETL across multiple machines.

#### ✅ PyTorch on Task Nodes → Performs matrix multiplications & model training using available hardware (CPU or GPU).


---

## Feature Engineering

### **What is Feature Engineering?**
Feature engineering is the process of **transforming raw data into meaningful inputs** that improve machine learning model performance. It involves **selecting, creating, and modifying features** to highlight patterns in data. Common techniques include **normalization, one-hot encoding, feature scaling, dimensionality reduction (PCA), and interaction terms**. Good feature engineering enhances model accuracy by making important patterns more detectable. 🚀

**For Example:** Say we have a Person, their features could include height, age, money he makes, education etc. 
There are 100s 10000s of features we can create, but feature engineering is deciding which ones are the most important, if we dont have enough features how can we create more. 

---

### **Curse of Dimensionality**
**The Curse of Dimensionality** refers to the problem where adding more features (dimensions) makes data sparser, increasing computation time and reducing model performance. When we add too many feature vectors, our data moves into a higher-dimensional space. This increases the possible solutions, making the data more sparse, meaning points are farther apart. As a result, models struggle to find meaningful patterns, leading to overfitting and poor generalization. **This is known as the Curse of Dimensionality.**

---

### **Common Feature Engineering Techniques**

#### One-Hot Encoding
One-hot encoding is a method used to convert categorical variables into a binary matrix. Our dataset has some columns filled with words such as gender, education, and cities.

In [6]:
import pandas as pd
features = ['Education', 'City', 'Gender']
## Perform one-hot encoding
df_encoded = pd.get_dummies(features,dtype=int)

print(df_encoded.to_string(index=False))

 City  Education  Gender
    0          1       0
    1          0       0
    0          0       1


Here, **One-Hot Encoding** transforms categorical labels into **binary vectors**:  
- **Education** → `010`  
- **City** → `100`  
- **Gender** → `0001`  

Each category is represented as a **unique binary vector**. 🚀

---

#### Feature Scaling
Feature scaling is crucial when working with numerical features that have different scales. Scaling ensures that all features contribute equally to the model, preventing any one feature from dominating the others. Also aids with computation. 

For Example ‘Age’ and ‘Income’ might have different scales. ‘Age’ might go from 0 to 100, while ‘Income’ could go from 20,000 to 200,000. 

In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

data  = pd.DataFrame({'Age': [25, 30, 35, 40, 45],
                      'Income': [50000, 60000, 75000, 100000, 90000],})

Scaled_data = scaler.fit_transform(data)

# Print results
print("Original Data:")
print(data)

print("\nScaled Data:")
print(Scaled_data)

Original Data:
   Age  Income
0   25   50000
1   30   60000
2   35   75000
3   40  100000
4   45   90000

Scaled Data:
[[-1.41421356 -1.35581536]
 [-0.70710678 -0.81348922]
 [ 0.          0.        ]
 [ 0.70710678  1.35581536]
 [ 1.41421356  0.81348922]]


From the code output above, now we can see that the 'Age' and 'Income' have been scaled to have a mean of 0 and a standard deviation of 1. This makes them compatible for modeling and helps algorithms that rely on distances or gradients perform better.

---

#### Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features while preserving important information. Techniques like **PCA (Principal Component Analysis)** transform data into fewer dimensions by finding the most **variance-preserving** directions, while **t-SNE (t-Distributed Stochastic Neighbor Embedding)** is used for **visualizing high-dimensional data** by mapping it to 2D or 3D space. This helps improve model efficiency, reduce overfitting, and handle the **Curse of Dimensionality**. 🚀

In [15]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample data (5 features)
data = np.array([[2, 8, 4, 6, 10],
                 [1, 9, 3, 5, 8],
                 [3, 7, 5, 7, 12],
                 [4, 6, 6, 8, 14]])

# Standardize the data (PCA works best when data is scaled)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Apply PCA to reduce dimensions to 2
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

# Convert to DataFrame for better readability
df_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2'])
print(df_pca)


   PC1           PC2
0  1.0  1.913261e-16
1  3.0 -4.527166e-17
2 -1.0 -8.030383e-17
3 -3.0  4.527166e-17


**Conclusion:**  
Your dataset primarily varies along **one dimension** (PC1), meaning it captures most of the data's structure. **PC2 contributes very little**, so reducing the dataset to **only PC1** may be sufficient.  

**🔥 Why Keep Only PC1?**  
✔ **Simplifies the data** 🚀  
✔ **Speeds up ML model training** 🏋️‍♂️  
✔ **Eliminates unnecessary/noisy dimensions** 🎯  

---

### What to do with Missing data

We can use mean/median replacement of missing values, or we can just drop the entire feature as a whole. These are very naive approaches are pretty terrible in actual application.

- Better approach is we can use KNN techniques to find closer values in dimensional space works best with Numerical Data
- For Categorical missing data we can use Deep Learning to predict these values. 
- Regression find linear and non-linear relationships between missing data. Most advanced techinique  MICE (Multiple Imputation by Chained Equations)

### **MICE** is considered the best approach if getting more data is not plausible

#### **Mice Full Breakdown**

🚀 What is MICE?
MICE (**Multiple Imputation by Chained Equations**) is a smart way to **fill in missing values** in a dataset. Instead of just using an average or a guess, MICE **predicts the missing values** using patterns in the existing data.
**MICE is not ideal for large datasets because it is computationally expensive, Best to use Deep Learning for Large Data**sets

---

#### Psuedo Code


 **1️⃣ First, Make an Initial Guess 🤔**
- Imagine you have a table of students' test scores with some missing values:

  | Student | Math | Science | English |
  |---------|------|---------|---------|
  | Alice   | 90   | **78 (guess)**   | 85      |
  | Bob     | **87 (guess)** | 75      | 80      |
  | Charlie | 85   | 80      | **82 (guess)**   |

- MICE **fills in missing values with a rough guess** (e.g., using the average of the other values).

---

**2️⃣ Train a Model for Each Row 📊**
- Now, MICE takes **one row at a time** and builds a mini **prediction model** using the other rows.
- Example: If **Alice’s Science score is missing**, MICE **uses Math & English** for inference to predict Science.
---

**3️⃣ Predict Missing Values and Update the Table 🔄**
- MICE **replaces the guessed values with better predictions** from the model.

  | Student | Math | Science | English |
  |---------|------|---------|---------|
  | Alice   | 90   | **79 (predicted)** | 85  |
  | Bob     | **87 (guess)** | 75 | 80  |
  | Charlie | 85   | 80      | **82 (guess)** |

---

**4️⃣ Repeat Until Everything is Stable 🔁**
- MICE **keeps improving the predictions** by repeating this process **multiple times** until the values **don’t change much**.
- When this happens, we have a **complete dataset with no missing values**! 🎉

---

#### 🔥 Why is MICE Better Than Just Using an Average?
✔ **Finds patterns in data** instead of making random guesses.  
✔ **More accurate than mean/median imputation** because it considers relationships between variables.  
✔ **Works well even when multiple features are missing.**  

---

#### Code Example

In [30]:
# Import necessary libraries
from sklearn.experimental import enable_iterative_imputer  # Enable MICE
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np

# Sample data with missing values
data = {
    'Age': [25, np.nan, 30, 35, 40], 
    'Salary': [50000, 60000, np.nan, 80000, 90000], 
    'Experience': [2, 4, 6, np.nan, 10]
}

df = pd.DataFrame(data).round(2)  # Create DataFrame

print(f"Before MICE:\n, {df}")

# Apply MICE to fill missing values
imputer = IterativeImputer()  # Use default model (BayesianRidge)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df_imputed = df_imputed.round(2)  # Round to 2 decimal places
df_imputed = df_imputed.astype(int)  # Convert to whole numbers (removes decimals)

print("\nAfter MICE (Rounded & No Decimals):\n", df_imputed)


Before MICE:
,     Age   Salary  Experience
0  25.0  50000.0         2.0
1   NaN  60000.0         4.0
2  30.0      NaN         6.0
3  35.0  80000.0         NaN
4  40.0  90000.0        10.0

After MICE (Rounded & No Decimals):
    Age  Salary  Experience
0   25   50000           2
1   27   60000           4
2   30   69946           6
3   35   80000           8
4   40   90000          10


---

### What to do with UnBalanced Data