## **Installing Necessary Libraries**

In [89]:
%pip install pandas numpy scikit-learn matplotlib seaborn # Simply run this code to install necessary libraries

## **Importing Necessary Libraries**

In [90]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **Loding NBA & Bank Dataset's**

In [91]:
NBA = 'Data/all_seasons.csv'
BANK = 'Data/bank-full.csv'

bank_df = pd.read_csv(BANK, sep=';')

nba_df  = pd.read_csv(NBA)

# **Statistics overview of Bank Data**

## **DataSet Head**

In [92]:
bank_df.head(5)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


## **Shape of the Dataset** 

In [93]:
# Shape of the DataFrame: returns a tuple (rows, columns)
Shape = bank_df.shape
# Print the information
print("Shape of dataset:", Shape[0], "x", Shape[1])  # Prints the number of rows and columns


Shape of dataset: 45211 x 17



**The dataset contains 45,211 rows and 17 columns, which suggests that there are 45,211 records of individuals, each having 17 features.**

## **Columns in the Dataset**

In [94]:
print("Columns in the dataset:\n", bank_df.columns)  # Prints column names


Columns in the dataset:
 Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')


### **The dataset includes the following columns:**
 
`age`: Age of the individual (integer).

`job`: Type of job the individual has (categorical).

`marital`: Marital status (categorical).

`education`: Level of education (categorical).

`default`: Whether the individual has credit in default (categorical).

`balance`: Account balance (integer).

`housing`: Whether the individual has a housing loan (categorical).

`loan`: Whether the individual has a personal loan (categorical).

`contact`: Type of contact used for communication (categorical).

`day`: Last contact day of the month (integer).

`month`: Last contact month (categorical).

`duration`: Duration of the last contact (integer).

`campaign`: Number of contacts performed during the campaign (integer).

`pdays`: Number of days since the individual was last contacted (integer).

`previous`: Number of contacts before the current campaign (integer).

`poutcome`: Outcome of the previous marketing campaign (categorical).

`y`: Whether the individual subscribed to a term deposit (binary outcome: 'yes' or 'no').

## **Data Types of Each Column**

In [95]:
print("Data Types of each column:\n", bank_df.dtypes)  # Prints the data types of each column


Data Types of each column:
 age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object


### **The dataset contains a mix of numerical and categorical columns:**
1. **Numerical columns**: `age`, `balance`, `day`, `duration`, `campaign`, `pdays`,& `previous`.
   
2. **Categorical columns**: `job`, `marital`, `education`, `default`, `housing`, `loan`, `contact`, `month`, `poutcome`, & `y`.

## **Missing Values**

In [96]:
print("Missing values in each column:\n", bank_df.isnull().sum())  # Prints count of missing values per column

Missing values in each column:
 age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64


**`There are no missing values in any of the columns. Each column contains complete data for all 45,211 records.`**

## **Summary Statistics**

In [97]:
print("Discryption:")
bank_df.describe()

Discryption:


Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


**`Age`**: The average age is approximately 40.94 years, with a minimum of 18 and a maximum of 95.

**`Balance`**: The average balance is about 1,362, but the balance can vary significantly, with a minimum of -8,019 (indicating negative balances) and a maximum of 102,127.

**`Day`**: The average last contact day is around 15.81, suggesting a fairly evenly distributed set of contact days across the month.

**`Duration`**: The average duration of the last contact is 258.16 seconds, with a minimum of 0 and a maximum of 4,918 seconds, showing large variation in the length of contact.

**`Campaign`**: The average number of contacts during the campaign is 2.76, with values ranging from 1 to 63.

**`Pdays`**: The number of days since last contact varies significantly, with a minimum of -1 (indicating no prior contact) and a maximum of 871.

**`Previous`**: The number of prior contacts ranges from 0 to 275, with an average of 0.58, indicating many individuals have had no prior contact.

# **Statistics overview of NBA Data**

## **DataSet Head**

In [98]:
nba_df.head()

Unnamed: 0.1,Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,...,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
0,0,Randy Livingston,HOU,22.0,193.04,94.800728,Louisiana State,USA,1996,2,...,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996-97
1,1,Gaylon Nickerson,WAS,28.0,190.5,86.18248,Northwestern Oklahoma,USA,1994,2,...,3.8,1.3,0.3,8.9,0.03,0.111,0.174,0.497,0.043,1996-97
2,2,George Lynch,VAN,26.0,203.2,103.418976,North Carolina,USA,1993,1,...,8.3,6.4,1.9,-8.2,0.106,0.185,0.175,0.512,0.125,1996-97
3,3,George McCloud,LAL,30.0,203.2,102.0582,Florida State,USA,1989,1,...,10.2,2.8,1.7,-2.7,0.027,0.111,0.206,0.527,0.125,1996-97
4,4,George Zidek,DEN,23.0,213.36,119.748288,UCLA,USA,1995,1,...,2.8,1.7,0.3,-14.1,0.102,0.169,0.195,0.5,0.064,1996-97


## **Shape of the Dataset** 

In [99]:
# Shape of the DataFrame: returns a tuple (rows, columns)
nba_shape = nba_df.shape
# Print the information
print("Shape of NBA dataset:", nba_shape[0], "x", nba_shape[1])  # Prints the number of rows and columns

Shape of NBA dataset: 12844 x 22


**The dataset contains 12,844 rows and 22 columns, indicating data about players' performance and attributes.**

## **Columns in the Dataset**

In [100]:
print("Columns in the NBA dataset:\n", nba_df.columns)  # Prints column names

Columns in the NBA dataset:
 Index(['Unnamed: 0', 'player_name', 'team_abbreviation', 'age',
       'player_height', 'player_weight', 'college', 'country', 'draft_year',
       'draft_round', 'draft_number', 'gp', 'pts', 'reb', 'ast', 'net_rating',
       'oreb_pct', 'dreb_pct', 'usg_pct', 'ts_pct', 'ast_pct', 'season'],
      dtype='object')


### **The dataset includes columns such as:**

- **Player attributes**: `player_name`, `team_abbreviation`, `age`, `player_height`, `player_weight`.
  
- **College and country information**: `college`, `country`.
  
- **Draft details**: `draft_year`, `draft_round`, `draft_number`.
  
- **Performance metrics**: `gp` (games played), `pts` (points), `reb` (rebounds), `ast` (assists), `net_rating`, `oreb_pct` (offensive rebound percentage), `dreb_pct` (defensive rebound percentage), `usg_pct` (usage percentage), `ts_pct` (true shooting percentage), `ast_pct` (assist percentage).
  
- **Additional columns**: `Unnamed: 0` (index), `season` (season year).

## **Data Types of Each Column**

In [101]:
print("Data Types of each column:\n", nba_df.dtypes)  # Prints the data types of each column

Data Types of each column:
 Unnamed: 0             int64
player_name           object
team_abbreviation     object
age                  float64
player_height        float64
player_weight        float64
college               object
country               object
draft_year            object
draft_round           object
draft_number          object
gp                     int64
pts                  float64
reb                  float64
ast                  float64
net_rating           float64
oreb_pct             float64
dreb_pct             float64
usg_pct              float64
ts_pct               float64
ast_pct              float64
season                object
dtype: object


### **The dataset contains a mix of numerical and categorical columns:**

- Numeric (int64/float64): Performance metrics and player physical attributes (`gp`, `pts`, `reb`, etc.).
  
- Categorical (object): `player_name`, `team_abbreviation`, `college`, `country`, `draft_year`, `draft_round`, `draft_number`, `season`.

## **Missing Values**

In [102]:
print("Missing values in each column:\n", nba_df.isnull().sum())  # Prints count of missing values per column

Missing values in each column:
 Unnamed: 0              0
player_name             0
team_abbreviation       0
age                     0
player_height           0
player_weight           0
college              1854
country                 0
draft_year              0
draft_round             0
draft_number            0
gp                      0
pts                     0
reb                     0
ast                     0
net_rating              0
oreb_pct                0
dreb_pct                0
usg_pct                 0
ts_pct                  0
ast_pct                 0
season                  0
dtype: int64


**The column college has 1,854 missing values.**
1. This column is irrelevant for our prediction

**All other columns have no missing values.**

## **Summary Statistics**

In [103]:
print("Discryption:")
nba_df.describe()

Discryption:


Unnamed: 0.1,Unnamed: 0,age,player_height,player_weight,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
count,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0,12844.0
mean,6421.5,27.045313,200.555097,100.263279,51.154158,8.212582,3.558486,1.824681,-2.226339,0.054073,0.140646,0.184641,0.513138,0.131595
std,3707.887763,4.339211,9.11109,12.426628,25.084904,6.016573,2.477885,1.80084,12.665124,0.043335,0.062513,0.053545,0.101724,0.094172
min,0.0,18.0,160.02,60.327736,1.0,0.0,0.0,0.0,-250.0,0.0,0.0,0.0,0.0,0.0
25%,3210.75,24.0,193.04,90.7184,31.0,3.6,1.8,0.6,-6.4,0.021,0.096,0.149,0.482,0.066
50%,6421.5,26.0,200.66,99.79024,57.0,6.7,3.0,1.2,-1.3,0.04,0.1305,0.181,0.525,0.103
75%,9632.25,30.0,208.28,108.86208,73.0,11.5,4.7,2.4,3.2,0.083,0.179,0.217,0.563,0.179
max,12843.0,44.0,231.14,163.29312,85.0,36.1,16.3,11.7,300.0,1.0,1.0,1.0,1.5,1.0


## **Descriptive Statistics**

- **Age**:
    - Average: 27 years, Range: 18 to 44 years.
- **Player Height**:
     - Average: ~200.6 cm, Range: 160.02 to 231.14 cm.
- **Player Weight**:
     - Average: ~100.26 kg, Range: 60.33 to 163.29 kg.
- **Games Played (`gp`)**:
     - Average: 51.15 games per season, Range: 1 to 85 games.
- **Points (`pts`)**:
     - Average: 8.21 points per game, Range: 0 to 36.1 points.
- **Net Rating**:
     - Mean: -2.23, Range: -250 to 300 (indicates variability in players' overall contributions).
- **Rebounds (`reb`)**:
     - Average: 3.56 per game, Range: 0 to 16.3.
- **Assists (`ast`)**:
     - Average: 1.82 per game, Range: 0 to 11.7.