# Day 1: Exploratory Data Analysis (EDA)
Welcome to your first step in Data Science! Today, we will learn how to explore a dataset to understand its patterns, trends, and anomalies.

### What is EDA?
Exploratory Data Analysis is the process of using visual and statistical methods to investigate a dataset. It helps us:
- Identify errors/missing data.
- Understand relationships between variables.
- Formulate hypotheses for future modeling.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

In [None]:
# 1. Create Dataset
np.random.seed(42) # For reproducibility
data = {
    'Total_Bill': np.random.normal(20, 5, 200),
    'Tip': np.random.exponential(3, 200),
    'Food_Type': np.random.choice(['Pizza', 'Kabsa', 'Shawarma', 'Salad'], 200),
    'Customer_Rating': np.random.randint(1, 6, 200)
}
df = pd.DataFrame(data)

print("Dataset Shape:", df.shape)
df.head()

### Section 1: Data Inspection
Before visualizing, we need to know the 'structure' of our data.
- **Numerical Data**: Numbers that allow mathematical operations (e.g., `Total_Bill`).
- **Categorical Data**: Groups or labels (e.g., `Food_Type`).
- **Missing Values**: Empty spots that could lead to biased models if not handled.


In [None]:
df.info()

#### Total_Bill, Mean = Median,Symmetric / Normal,Keep as is
#### Tip,Mean > Median, Right-Skewed, Apply Log Transform
#### Customer_Rating, Mean = Median, Balanced, Keep as is (or treat as ordinal)

In [None]:
df.describe()

In [None]:
df.isnull().sum()

### Section 2: Univariate Analysis
Analysis of **one variable** at a time.
#### Key Concepts:
- **Distribution**: How the values are spread out.
- **Skewness**: If data is symmetric (Normal) or leans to one side.
- **Outliers**: Extreme values that deviate significantly from others (best seen in Box Plots).

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Total_Bill'], kde=True, color='skyblue')
plt.title('Symmetric Distribution Bell-Shaped')
plt.xlabel('Total Bill')
plt.subplot(1, 2, 2)
sns.histplot(df['Tip'], kde=True, color='orange')
plt.title('Right-Skewed Distribution')
plt.xlabel('Tip Amount')

plt.tight_layout()
plt.show()

In [None]:
# transformation
df['Tip'] = np.log1p(df['Tip'])
sns.histplot(df['Tip'], kde=True, color='orange')
plt.title('Right-Skewed Distribution')
plt.xlabel('Tip Amount')
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Food_Type', data=df, palette='viridis')
plt.title('Food Type Count')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(x=df['Total_Bill'], color='lightgreen')
plt.title('Box Plot: Total Bill Distribution')
plt.xlabel('Total Bill ($)')
plt.show()

## Bivariate Analysis
Analysis of **two variables** to find relationships.
- **Correlation**: A number between -1 and 1 indicating how variables move together.
- **Categorical vs Numerical**: Seeing how groups differ (e.g., which food type has higher ratings).

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Total_Bill', y='Tip', data=df, color='blue')
plt.title('Total Bill vs Tip')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Food_Type', y='Total_Bill', data=df, palette='Set2')
plt.title('Total Bill Distribution by Food Type')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(df[['Total_Bill', 'Tip', 'Customer_Rating']].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

### Section 4: Feature Engineering
Sometimes the base data isn't enough. We create new columns to find better insights.
Let's calculate the **Tip Percentage**.


In [None]:
df['Tip_Percentage'] = (df['Tip'] / df['Total_Bill']) * 100
df[['Total_Bill', 'Tip', 'Tip_Percentage']].head()


### Section 5: Data Aggregation (GroupBy)
We can group our data to compare categories. For example, which food type is most expensive on average?


In [None]:
avg_by_food = df.groupby('Food_Type')[['Total_Bill', 'Tip', 'Customer_Rating']].mean()
avg_by_food.sort_values(by='Total_Bill', ascending=False)


### Student Observations:
> *Look at the results above. Which food type seems to attract the highest tips?*


In [None]:
%pip install ydata-profiling

In [None]:
profile = ProfileReport(df, title="Automated EDA Report", explorative=True)
profile.to_notebook_iframe()

## Conclusion & Next Steps
Congratulations! You've learned the basics of EDA.
1. **Inspect** the data structure.
2. **Clean** or transform variables.
3. **Visualize** distributions (Univariate).
4. **Analyze** relationships (Bivariate).

### Challenge Mission ðŸš€
1. Filter the dataframe to show only `Pizza` eaters and calculate their average `Customer_Rating`.
2. Create a boxplot for `Tip_Percentage` to see if there are outliers.