<a href="https://colab.research.google.com/github/Lilwm/Introduction-to-Machine-Learning/blob/main/Introduction_to_Machine_Learning_Assignment_Lillian_Miiri.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine Learning

## Question

#Background
A large Multinational has nine broad verticals across the organization. One of the problems they face is identifying the right people for promotion (only for the manager position and below) and preparing them in time.

# Task
predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process.



# 1. Data Exploration

In [2]:
import pandas as pd
import numpy as np

#read data from the  CSV file
employee_df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')
glossary_df = pd.read_csv('https://bit.ly/2Wz3sWcGlossary')

employee_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [3]:
#get the shape of the df
employee_df.shape

(54808, 14)

In [None]:
#get the column name & data types
employee_df.info()

Observations
- Dataframe have 14 columns and 54808 rows
- Total categorical columns are 5 columns
- Total numerical columns are 9 columns

In [5]:
#separate categorical and numerical data

#assign categorical data to categorical object
categorical = ['department','region','education','gender','recruitment_channel']

#assign numerical to nums object
nums = ['employee_id','no_of_trainings','age','previous_year_rating','length_of_service','KPIs_met >80%','awards_won?',
        'avg_training_score','is_promoted']

In [None]:
employee_df[nums].describe()

In [None]:
employee_df[categorical].describe()

### Conclusion
- The distribution of data for feature(no_of_trainings, age, length_of_service, avg_training_score looks normal (mean & median are close enough)
- majority in gender is male with frequency 38496
- majority in department is Sales & Marketing with frequency 16840
- majority in education is Bachelor's with frequency 36669
- majority in region is region_2 with frequency 12343
- majorityin recruitment_channel is other with frequency 30446

#2. Data Preparation

In [8]:
# Standardize a dataset by stripping leading and trailing spaces
employee_df.columns = employee_df.columns.str.strip()

In [9]:
# check for missing data in a dataset.
employee_df.isna().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [10]:
# replace missing data previous_year_rating column with the mean
mean_value = employee_df['previous_year_rating'].mean()
employee_df['previous_year_rating'].fillna(value=mean_value, inplace=True)

# replace missing education value with no education
employee_df['education'].fillna(value="Unknown", inplace=True)

#check for missing records to confirm replacement
employee_df.isna().sum()


employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [None]:
#perform data type conversion to fix previous year rating to int
employee_df['previous_year_rating'] = employee_df['previous_year_rating'].astype(np.int64)

#confirm if conversion was successful
employee_df.info()

In [12]:
# I can find and remove any duplicate records from a dataset.
employee_df.duplicated().sum()

0

## Observations
- There are 2 columns that have a missing value (education and previous_year_rating)
- is_promoted column is the target for this dataset

#3 Data Modelling


In [27]:

#import the necessary functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

df_train, df_valid = train_test_split(employee_df, test_size=0.25, random_state=12345)

features = df_train.drop(['education', 'recruitment_channel', 'is_promoted','department','region','gender'], axis=1)
target = df_train['is_promoted']


valid_features = df_valid.drop(['education', 'recruitment_channel', 'is_promoted','department','region','gender'], axis=1)
valid_target = df_valid['is_promoted']

model = DecisionTreeClassifier()
model.fit(features, target)
test_predictions = model.predict(valid_features)

# check the accuracy of your model
print("accuracy score:", accuracy_score(valid_target, test_predictions))

df = pd.DataFrame({'Real Values':valid_target, 'Predicted Values':test_predictions})
df

0.8698000291928185


Unnamed: 0,Real Values,Predicted Values
33399,0,0
17952,0,0
38273,0,0
5855,0,0
9125,0,0
...,...,...
34638,0,0
14864,0,1
44844,1,0
7627,1,0


#Conclusion

The model has an accuracy score of 86.98% which  is good enough but could be improved through tuning hyperparamenters