# Basic model with scikit-learn
> First model with scikit-learn
- toc: true 
- badges: false
- comments: true
- author: Cécile Gallioz
- categories: [Panda, sklearn]

# Imports

In [1]:
import pandas as pd

In [2]:
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")

# First analysis

In [3]:
print(f"The dataset contains {myDataFrame.shape[0]} samples and "
      f"{myDataFrame.shape[1]} columns")

The dataset contains 48842 samples and 14 columns


In [4]:
myDataFrame.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


## Which column is our target to predict?

In [5]:
target_column = 'class'

target_y = myDataFrame["class"]
data_X = myDataFrame.drop(columns="class")

In [6]:
target_y.value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

In [7]:
data_X.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


## Crosstab
Useful to detect columns containing the same information in two different forms (thus correlated). If this is the case, one of the columns is excluded. Here we excluded "education-num".

In [8]:
pd.crosstab(index=data_X['education'],
            columns=data_X['education-num'])

education-num,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
10th,0,0,0,0,0,1389,0,0,0,0,0,0,0,0,0,0
11th,0,0,0,0,0,0,1812,0,0,0,0,0,0,0,0,0
12th,0,0,0,0,0,0,0,657,0,0,0,0,0,0,0,0
1st-4th,0,247,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5th-6th,0,0,509,0,0,0,0,0,0,0,0,0,0,0,0,0
7th-8th,0,0,0,955,0,0,0,0,0,0,0,0,0,0,0,0
9th,0,0,0,0,756,0,0,0,0,0,0,0,0,0,0,0
Assoc-acdm,0,0,0,0,0,0,0,0,0,0,0,1601,0,0,0,0
Assoc-voc,0,0,0,0,0,0,0,0,0,0,2061,0,0,0,0,0
Bachelors,0,0,0,0,0,0,0,0,0,0,0,0,8025,0,0,0


In [9]:
data_X = data_X.drop(columns="education-num")

## Separation between numerical and categorical columns

In [10]:
print(f"The dataset data_X contains {data_X.shape[0]} samples and "
      f"{data_X.shape[1]} columns")

The dataset data_X contains 48842 samples and 12 columns


In [11]:
data_X.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

### We sort the variable names according to their type

In [12]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 
                       'race', 'sex', 'native-country']

all_columns = numerical_columns + categorical_columns

data_X = data_X[all_columns]

In [13]:
print(f"The dataset data_X contains {data_X.shape[0]} samples and "
      f"{data_X.shape[1]} columns")

The dataset data_X contains 48842 samples and 12 columns


In [14]:
data_X[numerical_columns].describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0
mean,38.643585,1079.067626,87.502314,40.422382
std,13.71051,7452.019058,403.004552,12.391444
min,17.0,0.0,0.0,1.0
25%,28.0,0.0,0.0,40.0
50%,37.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,45.0
max,90.0,99999.0,4356.0,99.0


In [15]:
data_X_numerical = data_X[numerical_columns]

## The model

### Train-test split the dataset

In [23]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_X_numerical, 
    target_y, 
    random_state=42, 
    test_size=0.25)

In [17]:
print(f"Number of samples in testing: {data_test.shape[0]} => "
      f"{data_test.shape[0] / data_X_numerical.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in testing: 12211 => 25.0% of the original set


In [18]:
print(f"Number of samples in training: {data_train.shape[0]} => "
      f"{data_train.shape[0] / data_X_numerical.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in training: 36631 => 75.0% of the original set


### To display nice model diagram

In [19]:
from sklearn import set_config
set_config(display='diagram')

### To create a logistic regression model in scikit-learn

In [20]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

### Use the fit method to train the model using the training data and labels

In [21]:
model.fit(data_train, target_train)

### Use the score method to check the model statistical performance on the test set

In [22]:
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.807
