# Group 111

Colby Todd 300241178

Engy Elsayed 300228400

# Introduction
## Goal
The overall goal of this analysis is to become familiar with EDA, and perform it on two datasets.

## Target Audience
The target audience is people learning about EDA and would like to see it performed on a German Credit dataset and F1 Drivers dataset.

# German Credit Dataset
## Description
The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.
 - Author: Prof. Hofmann
 - Purpose: Determing credit risk of a person
 - Shape: 10 columns, 1000 rows
 - Features:
   - Age (Numerical)
   - Sex (Categorical)
   - Job (Categorical)
   - Housing (Categorical)
   - Saving Accounts (Categorical)
   - Checking Account (Categorical)
   - Credit Amount (Numerical)
   - Duration (Numerical)
   - Purpose (Categorical)
 - Potentially Missing Values:
   - Saving accounts only hase 817 out of 1000 rows
   - Checking account only has 606 out of 1000 rows

In [None]:
import pandas as pd
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import kagglehub
from kagglehub import KaggleDatasetAdapter

# Load German Credit Dataset into a pandas dataframe
credit_df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "uciml/german-credit/versions/1",
    "german_credit_data.csv",
)

credit_df['Job'].replace(0, 'Unskilled and Non Resident', inplace=True)
credit_df['Job'].replace(1, 'Unskilled and Resident', inplace=True)
credit_df['Job'].replace(2, 'Skilled', inplace=True)
credit_df['Job'].replace(3, 'Highly Skilled', inplace=True)

credit_df.info()

## Insight 1
The majority of germans have less than DM5000 of credit, this implies that it is either difficult for germans to get acess to more than DM5000 of credit or germans tend to not want more than DM5000 of credit.

This is a univariate analysis using a histogram (r1).

This evidence was obtained by analyzing the right skewed histogram.

In [None]:
plt.figure(figsize=(12, 6))

# Used Reference 3
i1 = sns.histplot(credit_df['Credit amount'])

# Used Reference 2
i1.set_xlabel("Credit Amount", fontsize=14)
i1.set_ylabel("Count", fontsize=14)
i1.set_title("Counts of Credit Amount", fontsize=16)

## Insight 2
Double the amount of men have credit than women. This implies men are more likely to have credit than women.

This is a univariate analysis using a countplot (r2).

This evidence was obtained by analyzing the countplot.

In [None]:
plt.figure(figsize=(12, 6))

i2 = sns.countplot(x="Sex", data=credit_df)

# Used Reference 2
i2.set_xlabel("Sex", fontsize=14)
i2.set_ylabel("Count", fontsize=14)
i2.set_title("Count of Each Sex", fontsize=16)

# Used Reference 4
for p in i2.patches:
   i2.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))

## Insight 3
There is no correlation between credit amount and age, this implies that the amount of debt one has isn't affected by age.

This is a bivariate analysis using a scatterplot (r6).

This evidence was obtained by analyzing the scatter plot and calculating the strength of correlation.

In [None]:
plt.figure(figsize=(12, 6))

i3 = sns.scatterplot(x='Age', y='Credit amount', data=credit_df)

# Used reference 1
# Calculate correlation coefficient between x and y
r = scipy.stats.pearsonr(x=credit_df['Age'], y=credit_df['Credit amount'])[0]

# Add correlation coefficient to plot
i3.text(30, 10, 'r = ' + str(round(r, 2)))

# Used Reference 2
i3.set_xlabel("Age", fontsize=14)
i3.set_ylabel("Credit Amount", fontsize=14)
i3.set_title("Credit Amount vs Age", fontsize=16)

## Insight 4
The majority of people are skilled workers who own their home, this implies the majority of people who have credit have good jobs and own their house.

Thisi is a bivariate analysis using an ordered countplot ordered by count of owned houses (r5).

This evidence was obtained by analyzing the countplot comparing jobs with housing.

In [None]:
plt.figure(figsize=(12, 6))

i4 = sns.countplot(x="Job", hue="Housing", data=credit_df,
                  order=credit_df[credit_df["Housing"] == "own"]
                    .groupby("Job")
                    .size()
                    .sort_values(ascending=False)
                    .index
             )

# Used Reference 2
i4.set_xlabel("Job", fontsize=14)
i4.set_ylabel("Count", fontsize=14)
i4.set_title("Count of Type of Job and Housing", fontsize=16)

## Insight 5
Men have access to more credit than women, this implies men are able to get more credit than women.

This is a bivariate analysis using a box plot (r7).

This evidence was obtained by analyzing the boxplot.

In [None]:
plt.figure(figsize=(12, 6))

i5 = sns.boxplot(x="Sex", y="Credit amount", data=credit_df)

# Used Reference 2
i5.set_xlabel("Sex", fontsize=14)
i5.set_ylabel("Credit Amount", fontsize=14)
i5.set_title("Credit Amount by Sex", fontsize=16)

## Insight 6
There is a large disparity between percentage of female vs male with highly skilled jobs. This implies men tend to get the highly skilled jobs over women wheter it means they were able to get better education or are more favourably looked upon for these highly skilled jobs.

This is a bivariate analysis using a crosstab.

This evidence was obtained by analyzing the crosstab.

In [None]:
pd.crosstab(credit_df['Job'], credit_df['Sex'], margins=True)

## Insight 7
The majority of people have credit for personal reasons, this implies people with credit are using it for personal reasons instead of necessity.

This is a bivariate analysis using a countplot (r3). The data was grouped based on what type the purpose is, either personal, education, business or repairs.

This evidence was obtained by looking at the countplot.

In [None]:
plt.figure(figsize=(12, 6))

# Winter2025-CSI4142-Week2-EDA-CaseStudy-Part2 Slide 9
credit_df['Purpose'].replace('radio/TV', 'Personal', inplace=True)
credit_df['Purpose'].replace('car', 'Personal', inplace=True)
credit_df['Purpose'].replace('furniture/equipment', 'Personal', inplace=True)
credit_df['Purpose'].replace('domestic appliances', 'Personal', inplace=True)
credit_df['Purpose'].replace('vacation/others', 'Personal', inplace=True)
i7 = sns.countplot(x='Purpose', data=credit_df)

# Used Reference 2
i7.set_xlabel("Purpose", fontsize=14)
i7.set_ylabel("Count", fontsize=14)
i7.set_title("Count of Credit Puporse", fontsize=16)

## Insight 8
There is high linear correlation between duration and credit amount (0.62), this implies that the longer the duration the higher the credit.

This is bivariate numerical/numerical analysis.

The evidence was obtained through assesment of the scatter plot correlation.

In [None]:
plt.figure(figsize=(12, 6))

i8 = sns.scatterplot(x='Duration', y='Credit amount', data=credit_df)

# Used reference 1
# Calculate correlation coefficient between x and y
r = scipy.stats.pearsonr(x=credit_df['Duration'], y=credit_df['Credit amount'])[0]

# Add correlation coefficient to plot
i8.text(25, 30, 'r = ' + str(round(r, 2)))

# Used Reference 2
i8.set_xlabel("Duration", fontsize=14)
i8.set_ylabel("Credit Amount", fontsize=14)
i8.set_title("Credit Amount vs Duration", fontsize=16)

## Insight 9
Women get access to credit at younger ages than men, this implies young women are able to get credit easier than young men.

This is a bivariate categorical/numerical analysis.

The evidence was obtained through assesment of the box plot results.

In [None]:
plt.figure(figsize=(12, 6))

i9 = sns.boxplot(x="Sex", y="Age", data=credit_df)

# Used Reference 2
i9.set_xlabel("Sex", fontsize=14)
i9.set_ylabel("Age", fontsize=14)
i9.set_title("Sex vs Age of Credit", fontsize=16)

## Insight 10

We can see from this count plot that the most amount of checking account holders are either little or moderate, whereas rich people have significantly less checking accounts. This can imply that rich people do not like to keep their money in checking accounts.

This is a univariate analysis using a countplot.

The evidence was obtained by analyzing the countplot.

In [None]:
plt.figure(figsize=(12, 6))

i10 = sns.countplot(x='Checking account', data=credit_df)

# Used Reference 2
i10.set_xlabel("Checking Account", fontsize=14)
i10.set_ylabel("Count", fontsize=14)
i10.set_title("Count of Checking Accounts", fontsize=16)

# F1 Drivers dataset
## Description
Whether you're a seasoned motorsport enthusiast, an aspiring data scientist, or a curious fan, this dataset opens the doors to a plethora of analytical opportunities. Explore the evolution of driver strategies, track-specific adaptations, and team dynamics that contribute to the intense competition witnessed on iconic circuits worldwide. Uncover patterns, outliers, and trends that illuminate the nuances of F1 racing, shedding light on what separates the champions from the contenders.
 - Author: Aditi Awasthi
 - Purpose: Unlock the potential for groundbreaking research, data-driven storytelling, and the thrill of extracting insights from the intricate web of statistics. "Unleashing Speed" invites you to embark on a journey through the heart-pounding realm of Formula 1, where milliseconds make the difference and the pursuit of victory is a symphony of skill, engineering, and strategy.
 - Shape: 22 columns, 868 rows
 - Features:
   - Driver (Categorical)
   - Nationality (Categorical)
   - Seasons (Categorical)
   - Championships (Numerical)
   - Race_Entries (Numerical)
   - Race_Starts (Numerical)
   - Pole_Positions (Numerical)
   - Race_Wins (Numerical)
   - Podiums (Numerical)
   - Fastest_Laps (Numerical)
   - Points (Numerical)
   - Active (Categorical)
   - Championship Years (Categorical)
   - Decade (Numerical)
   - Pole_Rate (Numerical)
   - Start_Rate (Numerical)
   - Win_Rate (Numerical)
   - Podium_Rate (Numerical)
   - FastLap_Rate (Numerical)
   - Points_Per_Entry (Numerical)
   - Years_Active (Numerical)
   - Champion (Categorical)
 - Missing Values:
   - Championship Years is potentially missing data as it only has 34 rows.

In [None]:
# Load F1 Drivers Dataset into a pandas dataframe
drivers_df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "petalme/f1-drivers-dataset/versions/2",
    "F1Drivers_Dataset.csv",
)

drivers_df.info()

## Insight 1

From this count plot, we can conclude that most Formula 1 champions came from the United Kingdom (10) where as the rest of the countries only had up to 3 champions each. Additionally, we can see that Europe has the most champions (24), followed by South America (4), North America (3), and finally Africa (1). This implies that F1 as a sport, is very European-centric, as most drivers and champions are from Europe.

This is a bivariate categorical/categorical analysis. The insights were obtained by creating and analyzing the results of a count plot comparing champions and nationality.

In [None]:
plt.figure(figsize=(12, 6))
i1 = sns.countplot(x="Nationality", data=drivers_df[drivers_df["Champion"] == True])

# Used Reference 2
i1.set_xlabel("Nationality", fontsize=14)
i1.set_ylabel("Champions", fontsize=14)
i1.set_title("Champions by Nationality", fontsize=16)

i1.tick_params(axis='x', labelsize=12)

plt.xticks(rotation=45)

plt.show()

## Insight 2

From this count plot, we can conclude that most Formula 1 drivers have retired without ever becoming world champions. Additionally, the majority of world champions have already retired according to this plot.

This is bivariate categorical/categorical analysis (r4). They both are boolean, results obtained by observing countplot.

In [None]:
plt.figure(figsize=(12, 6))
i2 = sns.countplot(x="Active", hue="Champion", data=drivers_df)

# Used Reference 2
i2.set_xlabel("Active", fontsize=14)
i2.set_ylabel("Champion", fontsize=14)
i2.set_title("Active Champions", fontsize=16)

plt.show()

## Insight 3
There is very high linear correlation between Pole Positions and Race Wins (0.95). This implies that drivers who start at the front of the pack tend to win more often.

This is a bivariate numerical/numerical analysis.

We reached the conclusion by analyzing the scatterplot's correlation.

In [None]:
plt.figure(figsize=(12, 6))

i3 = sns.scatterplot(x="Pole_Positions", y="Race_Wins", data=drivers_df)

# Used reference 1
# Calculate correlation coefficient between x and y
r = scipy.stats.pearsonr(x=drivers_df["Pole_Positions"], y=drivers_df["Race_Wins"])[0]

# Add correlation coefficient to plot
i3.text(5, 30, "r = " + str(round(r, 2)))

# Used Reference 2
i3.set_xlabel("Pole Positions", fontsize=14)
i3.set_ylabel("Race Wins", fontsize=14)
i3.set_title("Pole Positions vs Race Wins", fontsize=16)

## Insight 4
It is rare for a driver to win a championship in under 7 years. This implies it takes experience to win a championship.

This is a bivariate numerical/numerical analysis.

We reached the conclusion by analyzing the boxplot.

In [None]:
plt.figure(figsize=(12, 6))
i4 = sns.boxplot(x="Championships", y="Years_Active", data=drivers_df)

# Used Reference 2
i4.set_xlabel("Championships", fontsize=14)
i4.set_ylabel("Years Active", fontsize=14)
i4.set_title("Championships vs Career Longevity", fontsize=16)

## Insight 5

There is very high linear correlation between Championships and Race Wins (0.92). This implies that drivers who win more races tend to win more world championships.

This is a bivariate numerical/numerical analysis.

We reached the conclusion by analyzing the scatterplot's correlation.

In [None]:
plt.figure(figsize=(12, 6))

i5 = sns.scatterplot(x="Championships", y="Race_Wins", data=drivers_df)

# Used reference 1
# Calculate correlation coefficient between x and y
r = scipy.stats.pearsonr(x=drivers_df["Championships"], y=drivers_df["Race_Wins"])[0]

# Add correlation coefficient to plot
i5.text(5, 30, "r = " + str(round(r, 2)))

# Used Reference 2
i5.set_xlabel("Championships", fontsize=14)
i5.set_ylabel("Race Wins", fontsize=14)
i5.set_title("Championships vs Race Wins", fontsize=16)

## Insight 6

There is high linear correlation between Race Starts and Race Wins (0.6). This implies that drivers with more experience tend to win more races as they have more training.

This is a bivariate numerical/numerical analysis.

We reached the conclusion by analyzing the scatterplot's correlation.

In [None]:
plt.figure(figsize=(12, 6))

i6 = sns.scatterplot(x="Race_Starts", y="Race_Wins", data=drivers_df)

# Used reference 1
# Calculate correlation coefficient between x and y
r = scipy.stats.pearsonr(x=drivers_df["Race_Starts"], y=drivers_df["Race_Wins"])[0]

# Add correlation coefficient to plot
i6.text(5, 30, "r = " + str(round(r, 2)))

# Used Reference 2
i6.set_xlabel("Race Starts", fontsize=14)
i6.set_ylabel("Race Wins", fontsize=14)
i6.set_title("Championships vs Race Wins", fontsize=16)

##Insight 7

Fast Laps means the driver got the fastest lap of the entire race. There is high linear correlation between Fast Lap Rate and Points per entry (0.59). This implies that drivers who the fast laps (they receive an additional point for it) tend to, on average, score higher points per race.

This is a bivariate numerical/numerical analysis.

We reached the conclusion by analyzing the scatterplot's correlation.

In [None]:
plt.figure(figsize=(12, 6))

i7 = sns.scatterplot(x="FastLap_Rate", y="Points_Per_Entry", data=drivers_df)

# Used reference 1
# Calculate correlation coefficient between x and y
r = scipy.stats.pearsonr(x=drivers_df["FastLap_Rate"], y=drivers_df["Points_Per_Entry"])[0]

# # Add correlation coefficient to plot
i7.text(0, 6, "r = " + str(round(r, 2)))

# Used Reference 2
i7.set_xlabel("Fast Lap Rate", fontsize=14)
i7.set_ylabel("Points Per Entry", fontsize=14)
i7.set_title("Fast Lap Rate vs Points Per Entry", fontsize=16)

##Insight 8

This boxplot shows that active drivers, on average, have more points per entry. This makes a lot of sense since the points system has changed in 2010 to a much higher points system (ex. 25 points for winning compared to 10). Additionally, in the old system, only the top 6 drivers were awarded points compared to 10 now. The median points per entry back then was 0.

This is a bivariate categorical/numerical analysis.

We reached the conclusion by analyzing the median points per race for active vs inactive drivers.

In [None]:
plt.figure(figsize=(12, 6))
i8 = sns.boxplot(x="Active", y="Points_Per_Entry", data=drivers_df)

# Used Reference 2
i8.set_xlabel("Active", fontsize=14)
i8.set_ylabel("Points Per Entry", fontsize=14)
i8.set_title("Active vs Points Per Entry", fontsize=16)

##Insight 9

We can see that most F1 drivers will start and end their entire career without ever winning a world championship, and very few can win more than one. This implies that winning an F1 world championship is very difficult and not obtained by many drivers.

This is univariate numerical analysis. We got the insight by observing the count plot and analyzing how many have won championships compared to have not won any.

In [None]:
plt.figure(figsize=(12, 6))
i9 = sns.countplot(x=drivers_df["Championships"])

# Used Reference 2
i9.set_xlabel("Championships", fontsize=14)
i9.set_ylabel("Count", fontsize=14)
i9.set_title("Championships Count", fontsize=16)

# Used Reference 4
for p in i9.patches:
   i9.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))

## Insight 10

We can see that the average career of an F1 driver does not surpass 1 year, and after 10-11 years, very few continue. This implies that F1 is a very cutthroat sport that does not give drivers the opportunity to continue if they do not perfrom well in the first year.

This is univariate numerical analysis, and the insight was found by analyzing the count plot and viewing how long every driver has been active.

In [None]:
plt.figure(figsize=(12, 6))
i10 = sns.countplot(x=drivers_df["Years_Active"])

# Used Reference 2
i10.set_xlabel("Years Active", fontsize=14)
i10.set_ylabel("Count", fontsize=14)
i10.set_title("Count of Years Active", fontsize=16)

# Used Reference 4
for p in i10.patches:
   i10.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))

# Conclusion
Overall we learnt a lot about the different relationships in the data from the two datasets, and we learnt what the data is able to tell us.

# References

1.   https://www.statology.org/seaborn-scatterplot-with-correlation-coefficient/
2.   https://www.w3schools.com/python/matplotlib_labels.asp
3. https://seaborn.pydata.org/generated/seaborn.histplot.html
4. https://www.tutorialspoint.com/matplotlib-how-to-show-the-count-values-on-the-top-of-a-bar-in-a-countplot
