# **Creativa Data Science Bootcamp: Session 3 - Exploratory Data Analysis (EDA)**
This notebook is part of Session 3 of the Creativa Data Science Bootcamp and focuses on Exploratory Data Analysis (EDA). EDA is a critical step in the data analysis process, allowing us to understand the underlying structure of the data, detect patterns, anomalies, and test hypotheses with the help of summary statistics and graphical representations.


## **1. Loading and Understanding the Data**
### Loading the Dataset
Let's start by loading the dataset and taking a first look at the data.

Note: Further `clean` the data before you move to the next steps and `reassign` the variables to the columns by yourself

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('Coaster.csv')
df.head()

Unnamed: 0,coaster_name,Length,Speed,Location,Status,Opening date,Type,Manufacturer,Height restriction,Model,...,speed1,speed2,speed1_value,speed1_unit,speed_mph,height_value,height_unit,height_ft,Inversions_clean,Gforce_clean
0,Switchback Railway,600 ft (180 m),6 mph (9.7 km/h),Coney Island,Removed,"June 16, 1884",Wood,LaMarcus Adna Thompson,,Lift Packed,...,6 mph,9.7 km/h,6.0,mph,6.0,50.0,ft,,0,2.9
1,Flip Flap Railway,,,Sea Lion Park,Removed,1895,Wood,Lina Beecher,,,...,,,,,,,,,1,12.0
2,Switchback Railway (Euclid Beach Park),,,"Cleveland, Ohio, United States",Closed,,Other,,,,...,,,,,,,,,0,
3,Loop the Loop (Coney Island),,,Other,Removed,1901,Steel,Edwin Prescott,,,...,,,,,,,,,1,
4,Loop the Loop (Young's Pier),,,Other,Removed,1901,Steel,Edwin Prescott,,,...,,,,,,,,,1,


### Data Information and Summary Statistics
We will check the data types, any missing values, and generate summary statistics.

In [2]:
# Display basic information about the dataset
df.info()

# Summary statistics
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1087 entries, 0 to 1086
Data columns (total 56 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   coaster_name                   1087 non-null   object 
 1   Length                         953 non-null    object 
 2   Speed                          937 non-null    object 
 3   Location                       1087 non-null   object 
 4   Status                         874 non-null    object 
 5   Opening date                   837 non-null    object 
 6   Type                           1087 non-null   object 
 7   Manufacturer                   1028 non-null   object 
 8   Height restriction             831 non-null    object 
 9   Model                          744 non-null    object 
 10  Height                         965 non-null    object 
 11  Inversions                     932 non-null    float64
 12  Lift/launch system             795 non-null    o

Unnamed: 0,Inversions,year_introduced,latitude,longitude,speed1_value,speed_mph,height_value,height_ft,Inversions_clean,Gforce_clean
count,932.0,1087.0,812.0,812.0,937.0,937.0,965.0,171.0,1087.0,362.0
mean,1.54721,1994.986201,38.373484,-41.595373,53.850374,48.617289,89.575171,101.996491,1.326587,3.824006
std,2.114073,23.475248,15.516596,72.285227,23.385518,16.678031,136.246444,67.329092,2.030854,0.989998
min,0.0,1884.0,-48.2617,-123.0357,5.0,5.0,4.0,13.1,0.0,0.8
25%,0.0,1989.0,35.03105,-84.5522,40.0,37.3,44.0,51.8,0.0,3.4
50%,0.0,2000.0,40.2898,-76.6536,50.0,49.7,79.0,91.2,0.0,4.0
75%,3.0,2010.0,44.7996,2.7781,63.0,58.0,113.0,131.2,2.0,4.5
max,14.0,2022.0,63.2309,153.4265,240.0,149.1,3937.0,377.3,14.0,12.0


## **2. Data Cleaning**
### Handling Missing Values
First, we will identify and handle missing values.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values

# Handling missing values (example: filling with median)
df.fillna(df.median(), inplace=True)

### Removing Duplicates
Removing duplicate rows to ensure data quality.

In [None]:
# Remove duplicates
df = df.drop_duplicates()

### Handling Outliers
Identify and handle outliers using the IQR method.

In [None]:
# Identify outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

## **3. Exploratory Data Analysis (EDA)**
### Univariate Analysis
Analyze the distribution of individual variables using histograms and KDE plots.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Example: Distribution of a specific column
sns.histplot(df['column_name'], kde=True)
plt.show()

# KDE plot
sns.kdeplot(df['column_name'])
plt.show()

### Categorical Data Analysis
Visualize the distribution of categorical variables using bar plots and pie charts.

In [None]:
# Bar plot
sns.countplot(x='categorical_column', data=df)
plt.show()

# Pie chart
df['categorical_column'].value_counts().plot.pie(autopct='%1.1f%%')
plt.show()

### Bivariate Analysis
Explore relationships between two variables with scatter plots and correlation matrices.

In [None]:
# Scatter plot
sns.scatterplot(x='column1', y='column2', data=df)
plt.show()

# Correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

### Multivariate Analysis
Explore relationships among more than two variables using pair plots and PCA.

In [None]:
# Pair plot
sns.pairplot(df)
plt.show()

# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df.select_dtypes(include=[np.number]))
plt.scatter(pca_result[:, 0], pca_result[:, 1])
plt.show()

## **4. Advanced EDA Techniques**
### Violin and Boxen Plots
Visualize distributions and detect patterns with violin and boxen plots.

In [None]:
# Violin plot
sns.violinplot(x='categorical_column', y='continuous_column', data=df)
plt.show()

# Boxen plot
sns.boxenplot(x='categorical_column', y='continuous_column', data=df)
plt.show()

### Clustering Analysis
Group similar data points using K-Means clustering.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df.select_dtypes(include=[np.number]))
df['cluster'] = kmeans.labels_

# PCA scatter plot with clusters
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=df['cluster'])
plt.show()

### Handling Skewness
Apply log transformation to reduce skewness in the data.

In [None]:
# Log transformation
df['log_column'] = np.log1p(df['column_name'])
sns.histplot(df['log_column'], kde=True)
plt.show()

## **5. Conclusion**
This notebook provides a comprehensive overview of EDA techniques using the given dataset. We explored various methods to understand the data, clean it, and visualize important patterns. These steps are crucial for any data analysis project to ensure the quality and reliability of insights derived from the data.