# <h1 align="center">Exploratory Data Analysis</h1>
<h3 align="center">Analyzing and visualizing data in AWS</h3>

## Pre-Reqs

### To start off what needs to be known for EDA

1. **Python** - But the test will not test your Python knowledge.
2. **Scikit_learn** Python library for machine learning models

#### Types of Data:
- Numerical
- Categorical
- Ordinal

### Data Distributions


#### 1. A Probability Density Function (PDF) 

<h4 align="center">Represents the probability distribution of a continuous random variable. Gives the likelihood (density) that the variable falls within a given range.</h4>
<img src="../Figures/EDA/PDF.png" style="width:400px; display:block; margin:auto;">


#### 2. A Probability Mass Function (PMF) 
<h4 align="center">Gives the exact probability of a discrete random variable taking on a specific value. NOT CONTINOUS</h4>
<img src="../Figures/EDA/PMF.png" style="width:400px; display:block; margin:auto;">


#### **Poisson Distribution** is another Example of a PMF it works with discrete values. We cant sell 1/2 of a car, so PMF is a distribution we will use. 

### Seasonality (repeating patterns over time)

<h2 align="center">Seasonality is looking at a period in the data set that presents a repetative pattern.</h2>
<img src="../Figures/EDA/Seasonality.png" style="width:400px; display:block; margin:auto;">

### Trend (long-term movement)

<h2 align="center">Trends is looking at the entire data set to see a upward or downward trend.</h2>
<p align="center">In this Example the trend is upward</p>
<img src="../Figures/EDA/Trends.png" style="width:400px; display:block; margin:auto;">

### Dataset - Sesonality = Trend

<h3 align="center">If we calculate the seasonality and subtract it from the entire dataset it will give us the smooth line trend of the data</h3>
<p align="center">In this Example we subtract the sesonality from the dataset to get the smooth trend curve</p>
<img src="../Figures/EDA/TrendSeason.png" style="width:400px; display:block; margin:auto;">

### Time Series Models

Time series data consists of **trend**, **seasonality**, and **noise**. 

- Use an **additive model** when seasonal effects stay constant over time (e.g., sales increase by a fixed amount each year).
- Use a **multiplicative model** when seasonality scales with the trend (e.g., sales double during holidays as the business grows).


## AWS Athena (Serverless)(Near-RealTime)


AWS Athena is a serverless, interactive query service that allows you to run SQL queries on data stored in Amazon S3 without needing to set up or manage databases or servers.

**Use Case:** when you have large datasets in S3 and need quick SQL queries without managing a database.

 **Cost Breakdown 💰**
- **Pay-as-you-go**: No upfront costs, pay only for queries run.
- **$5 per TB scanned**: Charges are based on the amount of data scanned by queries.

 **Ways to Reduce Costs 💡**
- **Use columnar storage formats** (e.g., **ORC, Parquet**).
  - Improves query performance.
- **Partition and compress data** to optimize query efficiency.

- **AWS Glue** (for data cataloging) and **Amazon S3** (for data storage) have their **own separate charges**.

🔹 **Tip:** Optimize your data storage strategy to minimize the data scanned and reduce costs!


## AWS QuickSight (Serverless)(Visualization of Data)

### An application that helps generate visualizations and business insights from large datasets efficiently.

Machine Learning Capabilites:
- Anomaly detection
- Forecasting
- Auto-narratives

### QuickSight Visual Types

AutoGraph: based on the properties on the data gentrates a graph associated to the data, but is not always right. 

Bar Charts
- For comparison and distribution (histograms)

Line graphs
- For changes over time

Scatter plots, heat maps
- For correlation

Pie graphs, tree maps
- For aggregation

Pivot tables
- For tabular data

### **Know which charts are for which dataset, might be a question on the test**
### Example: Based on this data what is the best visualization for it? 

## AWS Elastic Map Reduce (EMR)

**Amazon EMR (Elastic MapReduce)** is a managed big data platform that allows you to run Apache Spark, Hadoop, Presto, Hive, and other big data frameworks at scale on AWS. It is designed for processing large-scale datasets across a distributed computing environment.

It is designed for processing **massive datasets**, making it easier to **prepare, normalize, and scale** data before feeding it into machine learning algorithms. 

### 💡 Spark is usually preferred over Hadoop for new projects because it's faster and supports real-time processing.

### EMR Cluster Architecture & Workflow

<h4 align="center">Represents the default Architecture of EMR, core and task nodes can be added to make the Architecture bigger. Also you can run mutilple different EMR clusters for different tasks</h4>
<img src="../Figures/EDA/EMRArch.png" style="width:600px; display:block; margin:auto;">


### Spark ML Lib

### **What is Spark MLlib**
A key advantage of Spark MLlib is that it enables **Common ML algorithms to run on distributed computing clusters**. In contrast, libraries like **Scikit-Learn** are designed for single-node execution and do not natively support distributed computing.

### **🚀 Machine Learning Algorithms in Spark MLlib**
Spark MLlib provides a variety of machine learning models optimized for large-scale data processing:

- **Classification**: Logistic Regression, Naïve Bayes  
- **Regression**: Linear and Non-Linear Regression Models  
- **Decision Trees**: Random Forest, Gradient-Boosted Trees  
- **Recommendation Systems**: Alternating Least Squares (ALS)  
- **Clustering**: K-Means  
- **Topic Modeling**: Latent Dirichlet Allocation (LDA)  
- **ML Workflow Utilities**: Pipelines, Feature Transformation, Model Persistence  
- **Dimensionality Reduction**: Singular Value Decomposition (SVD), Principal Component Analysis (PCA)  
- **Statistics**: Summary Statistics, Hypothesis Testing  

💡 **Key Takeaway**: If you need scalable machine learning for **big data**, Spark MLlib is a better choice than traditional libraries like **Scikit-Learn**.


### WorkFlow for EMR Model training

#### ✅ Spark on Core Nodes → Handles distributed data preprocessing & ETL across multiple machines.

#### ✅ PyTorch on Task Nodes → Performs matrix multiplications & model training using available hardware (CPU or GPU).


## Feature Engineering