# Datasets Introduction

### Dataset 1: IMDB Movies Dataset
This dataset contains the top 1000 movies from IMDb, including numerical information such as ratings, votes, and gross revenue, along with categorical and text-based attributes like title, genre, director, and cast. 

### Dataset 2: AI Impact on Jobs 2030
This dataset models how different professions might be affected by AI-driven automation by 2030. It includes variables such as salary, experience, education level, automation probability, and multiple skill indicators. 
### Dataset 3: Car Price Analysis Dataset
This dataset contains 2,500 car listings with a mix of numerical and categorical variables, including price, mileage, year, engine size, brand, fuel type, and condition.

# Conclusion

## Why I selected the dataset3 car_price_prediction

**Reasons for choosing the Car dataset**
- More useful numerical and categorical variables for analysis.
- No missing values and a clean structure.
- Larger sample size (2500 rows) for more reliable insights.
- Real-world context makes results easier to interpret.

---

## Why I did not select the other datasets

**dataset1 IMDB Top 1000 Movies**
- Too many text-only columns.
- Fewer meaningful numerical variables.
- Requires more cleaning (year, gross, genres).

**dataset2 AI_Impact_on_Jobs_2030**
- Distributions look artificial and overly symmetric.
- Several fields appear unrealistic.
- Skill columns lack clear meaning and are hard to interpret.


## Early idea for analysis direction

- Explore which factors (mileage, year, engine size, condition) most strongly influence car price.
- Compare depreciation patterns across different brands.
- Check linear relationships using scatter plots and correlations.
- Use grouping and visualization to see how price varies across brands and conditions.


# Dataset Exploration

## Dataset 1: IMDB Top 1000 Movies

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### 1. Load Dataset

In [2]:
df1 = pd.read_csv('datasets/imdb_top_1000.csv')

### 2. Structure & Overiew

In [3]:
df1.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


In [5]:
df1.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,1000.0,843.0,1000.0
mean,7.9493,77.97153,273692.9
std,0.275491,12.376099,327372.7
min,7.6,28.0,25088.0
25%,7.7,70.0,55526.25
50%,7.9,79.0,138548.5
75%,8.1,87.0,374161.2
max,9.3,100.0,2343110.0


### 3. Missing Values

In [6]:
df1.isnull().sum()

Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

### 4. Column types (IMDB Movies)

**Primary key / identifier**
- There is no explicit ID column, but the row index can be treated as a unique identifier for each movie.
- `Series_Title` is also almost unique in this dataset and can be used as a human–readable identifier.

**Date / temporal column**
- `Released_Year` (object) – stores the year the movie was released.  
  I will later convert this column to a numeric type (int) so that I can analyze trends over time.

**Categorical / text columns**
- `Poster_Link` – URL of the movie poster.
- `Series_Title` – movie title.
- `Certificate` – age rating (e.g., PG, R).
- `Genre` – movie genre(s).
- `Overview` – short plot summary.
- `Director` – director name.
- `Star1`, `Star2`, `Star3`, `Star4` – main cast.

**Numerical columns**
- `IMDB_Rating` (float64) – IMDb user rating.
- `Meta_score` (float64) – Metascore from critics (some missing values).
- `No_of_Votes` (int64) – number of user votes.
- `Gross` (object) – movie gross revenue; stored as text now, but conceptually a numeric column.  
  I will need to clean and convert it to a numeric type for analysis.

### 5. Potential analysis questions (IMDB Movies)
- Do movies with higher IMDb ratings tend to earn higher box-office gross?
- Which genres are associated with higher IMDb ratings?
- How do average IMDb ratings differ across release years or decades?
- Are critic scores (Meta_score) aligned with IMDb audience ratings?

### 6. Strengths and weaknesses of the IMDB Movies dataset

**Strengths**
- Contains rich numerical and categorical variables suitable for EDA.
- Includes audience ratings, critic scores, votes, and genres for diverse analysis.
- Easy-to-understand domain with intuitive interpretation.

**Weaknesses**
- Some columns (Gross, Meta_score, Certificate) contain missing values.
- Gross and Released_Year need cleaning and type conversion.
- Dataset includes only “Top 1000” movies, leading to selection bias.

## Dataset 2: AI_Impact_on_Jobs_2030

### 1. Load Dataset

In [7]:
df2 = pd.read_csv('datasets/AI_Impact_on_Jobs_2030.csv')

### 2. Structure & Overiew

In [8]:
df2.head()

Unnamed: 0,Job_Title,Average_Salary,Years_Experience,Education_Level,AI_Exposure_Index,Tech_Growth_Factor,Automation_Probability_2030,Risk_Category,Skill_1,Skill_2,Skill_3,Skill_4,Skill_5,Skill_6,Skill_7,Skill_8,Skill_9,Skill_10
0,Security Guard,45795,28,Master's,0.18,1.28,0.85,High,0.45,0.1,0.46,0.33,0.14,0.65,0.06,0.72,0.94,0.0
1,Research Scientist,133355,20,PhD,0.62,1.11,0.05,Low,0.02,0.52,0.4,0.05,0.97,0.23,0.09,0.62,0.38,0.98
2,Construction Worker,146216,2,High School,0.86,1.18,0.81,High,0.01,0.94,0.56,0.39,0.02,0.23,0.24,0.68,0.61,0.83
3,Software Engineer,136530,13,PhD,0.39,0.68,0.6,Medium,0.43,0.21,0.57,0.03,0.84,0.45,0.4,0.93,0.73,0.33
4,Financial Analyst,70397,22,High School,0.52,1.46,0.64,Medium,0.75,0.54,0.59,0.97,0.61,0.28,0.3,0.17,0.02,0.42


In [9]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Job_Title                    3000 non-null   object 
 1   Average_Salary               3000 non-null   int64  
 2   Years_Experience             3000 non-null   int64  
 3   Education_Level              3000 non-null   object 
 4   AI_Exposure_Index            3000 non-null   float64
 5   Tech_Growth_Factor           3000 non-null   float64
 6   Automation_Probability_2030  3000 non-null   float64
 7   Risk_Category                3000 non-null   object 
 8   Skill_1                      3000 non-null   float64
 9   Skill_2                      3000 non-null   float64
 10  Skill_3                      3000 non-null   float64
 11  Skill_4                      3000 non-null   float64
 12  Skill_5                      3000 non-null   float64
 13  Skill_6           

In [10]:
df2.describe()

Unnamed: 0,Average_Salary,Years_Experience,AI_Exposure_Index,Tech_Growth_Factor,Automation_Probability_2030,Skill_1,Skill_2,Skill_3,Skill_4,Skill_5,Skill_6,Skill_7,Skill_8,Skill_9,Skill_10
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,89372.279,14.677667,0.501283,0.995343,0.501503,0.496973,0.497233,0.499313,0.503667,0.49027,0.499807,0.49916,0.502843,0.501433,0.493627
std,34608.088767,8.739788,0.284004,0.287669,0.247881,0.287888,0.288085,0.288354,0.287063,0.285818,0.28605,0.288044,0.289832,0.285818,0.286464
min,30030.0,0.0,0.0,0.5,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,58640.0,7.0,0.26,0.74,0.31,0.24,0.25,0.25,0.26,0.24,0.26,0.25,0.25,0.26,0.25
50%,89318.0,15.0,0.5,1.0,0.5,0.505,0.5,0.5,0.51,0.49,0.5,0.49,0.5,0.5,0.49
75%,119086.5,22.0,0.74,1.24,0.7,0.74,0.74,0.75,0.75,0.73,0.74,0.75,0.75,0.74,0.74
max,149798.0,29.0,1.0,1.5,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 3. Missing Values

In [11]:
df2.isnull().sum()

Job_Title                      0
Average_Salary                 0
Years_Experience               0
Education_Level                0
AI_Exposure_Index              0
Tech_Growth_Factor             0
Automation_Probability_2030    0
Risk_Category                  0
Skill_1                        0
Skill_2                        0
Skill_3                        0
Skill_4                        0
Skill_5                        0
Skill_6                        0
Skill_7                        0
Skill_8                        0
Skill_9                        0
Skill_10                       0
dtype: int64

### 4. Column types (AI Impact on Jobs 2030)

**Primary Key / identifier**
- There is no explicit ID column, but the row index uniquely identifies each job.
- `Job_Title` is also effectively unique and can be used as a human–readable identifier.

**Categorical columns**
- `Job_Title` – Name of the occupation.
- `Education_Level` – Required education level (e.g., High School, Bachelor, Master).
- `Risk_Category` – Categorical risk level (e.g., Low, Medium, High).

**Numerical columns**
- `Average_Salary` (int64) – Average annual salary for the job.
- `Years_Experience` (int64) – Required years of experience.
- `AI_Exposure_Index` (float64) – Degree to which the job is affected by AI.
- `Tech_Growth_Factor` (float64) – Projected industry technology growth rate.
- `Automation_Probability_2030` (float64) – Probability of automation by 2030.
- `Skill_1` to `Skill_10` (float64) – Numerical indicators for different skill strengths.
  (These behave like feature vectors and allow comparison of jobs by skill composition.)

**Date/temporal columns**
- There are no explicit date fields in this dataset.
  However, the dataset includes future-oriented values (e.g., automation probability for 2030),
  which still support trend-style analysis.


### 5. Potential analysis questions (AI Impact on Jobs 2030)
- Which job categories show the highest automation risk by 2030?
- Is higher salary associated with lower or higher automation probability?
- How does education level relate to AI exposure?
- How do different skill profiles correlate with automation risk?


### 6. Strengths and weaknesses of the AI Jobs dataset

**Strengths**
- Large dataset (3000 rows) with diverse numerical fields for correlation analysis.
- No missing values, making preprocessing easier.
- Topic is modern and intuitive, suitable for generating meaningful questions.

**Weaknesses**
- Numerical distributions are overly symmetric and lack natural skew, suggesting synthetic generation.
- Some variable combinations (salary, education, experience) appear unrealistic.
- Skill_1–Skill_10 lack clear definitions, limiting interpretability.

## Dataset 3: car_price_prediction

### 1. Load Dataset

In [12]:
df2 = pd.read_csv('datasets/car_price_prediction_.csv')

### 2. Structure & Overiew

In [13]:
df2.head()

Unnamed: 0,Car ID,Brand,Year,Engine Size,Fuel Type,Transmission,Mileage,Condition,Price,Model
0,1,Tesla,2016,2.3,Petrol,Manual,114832,New,26613.92,Model X
1,2,BMW,2018,4.4,Electric,Manual,143190,Used,14679.61,5 Series
2,3,Audi,2013,4.5,Electric,Manual,181601,New,44402.61,A4
3,4,Tesla,2011,4.1,Diesel,Automatic,68682,New,86374.33,Model Y
4,5,Ford,2009,2.6,Diesel,Manual,223009,Like New,73577.1,Mustang


In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car ID        2500 non-null   int64  
 1   Brand         2500 non-null   object 
 2   Year          2500 non-null   int64  
 3   Engine Size   2500 non-null   float64
 4   Fuel Type     2500 non-null   object 
 5   Transmission  2500 non-null   object 
 6   Mileage       2500 non-null   int64  
 7   Condition     2500 non-null   object 
 8   Price         2500 non-null   float64
 9   Model         2500 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 195.4+ KB


In [15]:
df2.describe()

Unnamed: 0,Car ID,Year,Engine Size,Mileage,Price
count,2500.0,2500.0,2500.0,2500.0,2500.0
mean,1250.5,2011.6268,3.46524,149749.8448,52638.022532
std,721.83216,6.9917,1.432053,87919.952034,27295.833455
min,1.0,2000.0,1.0,15.0,5011.27
25%,625.75,2005.0,2.2,71831.5,28908.485
50%,1250.5,2012.0,3.4,149085.0,53485.24
75%,1875.25,2018.0,4.7,225990.5,75838.5325
max,2500.0,2023.0,6.0,299967.0,99982.59


### 3. Missing Values

In [16]:
df2.isnull().sum()

Car ID          0
Brand           0
Year            0
Engine Size     0
Fuel Type       0
Transmission    0
Mileage         0
Condition       0
Price           0
Model           0
dtype: int64

### 4. Column Types (Car Dataset)

**Primary key / identifier**
- `Car ID` – unique numeric identifier for each vehicle. It can be treated as the primary key.

**Date / temporal column**
- `Year` (int64) – the manufacturing year of the car.  
  This is a temporal numeric variable and allows analysis of price trends and depreciation over time.

**Categorical / text columns**
- `Brand` – manufacturer of the car (e.g., Toyota, BMW).
- `Fuel Type` – fuel category (Petrol, Diesel, Electric, Hybrid).
- `Transmission` – gearbox type (Automatic / Manual).
- `Condition` – overall state of the vehicle (e.g., New, Used, Like New).
- `Model` – the specific model name for each car.

**Numerical columns**
- `Engine Size` (float64) – engine displacement (e.g., 2.0L).
- `Mileage` (int64) – total distance driven.
- `Price` (float64) – car price in USD.

**Notes on data cleaning**
- No missing values are present.
- All numerical fields are already in the correct dtype.
- Potential cleaning work: outlier inspection (Mileage and Price), normalization if needed.


### 5. Potential Analysis Questions

- How does mileage affect car price?  
- Which brands tend to retain value better than others?  
- Is engine size associated with car price or mileage?  
- How does the manufacturing year influence the purchasing price?

### 6. Strengths and weaknesses of the Car dataset

**Strengths**
- Includes a balanced mix of numerical and categorical variables, enabling both correlation analysis and group comparisons.
- No missing values, allowing the dataset to be used immediately without heavy preprocessing.
- Real-world context (car pricing and depreciation) makes findings intuitive and easy to interpret.

**Weaknesses**
- Price and mileage ranges are extremely wide, increasing the likelihood of outliers that may distort analysis.
- Certain brands or models may appear far more frequently than others, creating potential imbalance in comparisons.
- Broad variation in vehicle condition and age could mix very different car categories, requiring segmentation for reliable insights.
