---

# Exploratory Data Analysis (EDA)

**Exploratory Data Analysis (EDA)** is a crucial step in the data science process. It involves analyzing datasets to summarize their main characteristics and gain insights before applying more complex modeling techniques. EDA helps in understanding the data, detecting anomalies, discovering patterns, and testing hypotheses.

## Key Goals of EDA:
1. **Understand the structure of the data**: Identify data types, variable distributions, and relationships.
2. **Identify data quality issues**: Detect missing values, outliers, and anomalies.
3. **Spot patterns and relationships**: Find correlations and trends that may guide modeling.
4. **Make data-driven hypotheses**: Generate hypotheses based on insights and trends from the data.

## Key Activities in EDA:

### 1. Descriptive Statistics
- Calculate basic metrics such as mean, median, mode, standard deviation, skewness, and kurtosis.
- Provides an overall summary of the data distribution.

```python
# Example in Python using pandas
df.describe()
```

### 2. Data Visualization
Visualizations help in uncovering trends, patterns, and relationships that may not be apparent from raw numbers.
- **Histograms**: Show the distribution of numerical variables.
- **Box plots**: Highlight outliers and the spread of the data.
- **Scatter plots**: Examine relationships between two numerical variables.
- **Correlation matrices**: Assess the strength and direction of relationships between features.

```python
# Example visualizing with matplotlib and seaborn
import seaborn as sns
sns.pairplot(df)
```

### 3. Missing Data Analysis
- Check for missing or null values.
- Explore the patterns of missing data and how they might affect the analysis or model performance.
- Decide on how to handle missing data (e.g., imputation, removal).

```python
# Checking for missing values
df.isnull().sum()
```

### 4. Outlier Detection
- Outliers are extreme values that may distort the results of the analysis. They can be detected using visual tools like box plots or statistical methods like Z-scores.

```python
# Using a boxplot to detect outliers
sns.boxplot(df['column'])
```

### 5. Feature Relationships
- Examine how variables interact with each other. Techniques like **correlation matrices** or **scatter plots** help reveal relationships and dependencies.
- Categorical variables can be analyzed with **grouped bar plots** or **pivot tables**.

```python
# Correlation matrix to find relationships between numerical features
df.corr()
sns.heatmap(df.corr(), annot=True)
```

### 6. Hypothesis Testing
- EDA often involves testing initial hypotheses or assumptions about the data, guiding further modeling efforts.
- For example, you might test if certain features significantly impact the target variable (e.g., through t-tests, ANOVA, or chi-square tests).

## Importance of EDA:
- **Informs Feature Selection**: Helps identify important features or redundant variables to focus on during the modeling phase.
- **Data Quality Check**: Detects issues like missing data, outliers, or data entry errors.
- **Guides Modeling**: Helps determine the most appropriate algorithms and techniques based on the nature of the data (e.g., the distribution of variables or relationships between them).
- **Reduces Complexity**: Early insights can simplify the dataset, making it more manageable and interpretable.

## Example Workflow of EDA:
1. **Start with Descriptive Statistics**: Summarize the basic metrics of your data.
2. **Visualize the Data**: Use plots to uncover hidden patterns and relationships.
3. **Handle Missing Values**: Address any missing or incomplete data.
4. **Identify Outliers**: Detect and handle extreme values that could skew analysis.
5. **Analyze Relationships**: Study correlations and associations between variables.

## Tools for EDA:
- **Python**:
  - Libraries: `pandas`, `matplotlib`, `seaborn`, `plotly`
- **R**:
  - Libraries: `ggplot2`, `dplyr`, `tidyverse`

---

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 


In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/aishwaryamate/Datasets/refs/heads/main/data_clean.csv", index_col=0)
df


Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
1,41.0,190.0,7.4,5,1,2010,67,S
2,36.0,118.0,8.0,5,2,2010,72,C
3,12.0,149.0,12.6,5,3,2010,74,PS
4,18.0,313.0,11.5,5,4,2010,62,S
5,,,14.3,5,5,2010,56,S
...,...,...,...,...,...,...,...,...
154,41.0,190.0,7.4,5,1,2010,67,C
155,30.0,193.0,6.9,9,26,2010,70,PS
156,,145.0,13.2,9,27,2010,77,S
157,14.0,191.0,14.3,9,28,2010,75,S


In [4]:
df.describe()


Unnamed: 0,Ozone,Solar.R,Wind,Day,Year,Temp
count,120.0,151.0,158.0,158.0,158.0,158.0
mean,41.583333,185.403974,9.957595,16.006329,2010.0,77.727848
std,32.620709,88.723103,3.511261,8.997166,0.0,9.377877
min,1.0,7.0,1.7,1.0,2010.0,56.0
25%,18.0,119.0,7.4,8.0,2010.0,72.0
50%,30.5,197.0,9.7,16.0,2010.0,78.5
75%,61.5,257.0,11.875,24.0,2010.0,84.0
max,168.0,334.0,20.7,31.0,2010.0,97.0


In [5]:
df.dtypes


Ozone      float64
Solar.R    float64
Wind       float64
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [6]:
df["Month"].unique()


array(['5', 'May', '6', '7', '8', '9'], dtype=object)

In [7]:
df["Month"].replace("May",5, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Month"].replace("May",5, inplace=True)


In [8]:
df["Month"].astype(int,True)


1      5
2      5
3      5
4      5
5      5
      ..
154    5
155    9
156    9
157    9
158    9
Name: Month, Length: 158, dtype: int64

In [9]:
df['Month'] = df['Month'].astype(int)


In [10]:
df.dtypes


Ozone      float64
Solar.R    float64
Wind       float64
Month        int64
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

## Duplicates

In [11]:
df.duplicated()


1      False
2      False
3      False
4      False
5      False
       ...  
154    False
155    False
156    False
157     True
158    False
Length: 158, dtype: bool

In [12]:
df.duplicated().sum()


np.int64(1)

In [13]:
df.drop_duplicates(inplace= True)


In [14]:
df.shape


(157, 8)

In [15]:
df.head()


Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
1,41.0,190.0,7.4,5,1,2010,67,S
2,36.0,118.0,8.0,5,2,2010,72,C
3,12.0,149.0,12.6,5,3,2010,74,PS
4,18.0,313.0,11.5,5,4,2010,62,S
5,,,14.3,5,5,2010,56,S


In [16]:
df.drop(columns=["Year"],inplace= True)
