# Logistic Regression Assignment

- Run the below cells. If you have the data in a different directory, you'll need to change the url.
- Complete all of the numbered questions. You may call any packages that we've used in class.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/log_reg/employee-turnover-balanced.csv')
df.head()

Unnamed: 0,left_company,age,frequency_of_travel,department,commuting_distance,education,satisfaction_with_environment,gender,seniority_level,position,satisfaction_with_job,married_or_single,last_raise_pct,last_performance_rating,total_years_working,years_at_company,years_in_current_job,years_since_last_promotion,years_with_current_supervisor
0,No,37,Travel_Rarely,Sales,16,4,4,Male,2,Sales Executive,3,Divorced,19,3,9,1,0,0,0
1,No,39,Travel_Rarely,Research & Development,3,2,3,Male,2,Laboratory Technician,3,Divorced,15,3,11,10,8,0,7
2,No,52,Travel_Frequently,Research & Development,25,4,3,Female,4,Manufacturing Director,4,Married,22,4,31,9,8,0,0
3,No,50,Non-Travel,Sales,1,3,4,Female,2,Sales Executive,3,Married,12,3,19,18,7,0,13
4,No,44,Travel_Rarely,Research & Development,4,3,4,Male,2,Healthcare Representative,2,Single,12,3,10,5,2,2,3


## Data Definitions
- `left_company`: Whether individual left the company or not. This is the target variable.  
- `age`: Age of individual. 
- `frequency_of_travel`: How often person travels for work.  
- `department`: Department person works(worked).  
- `commuting_distance`: Distance person lives from office.  
- `education`: Highest education category.  
- `satisfaction_with_environment`: Satisfaction of environment, on lickert scale.  
- `gender`: Gender of individual.  
- `seniority_level`: Seniority level of individual.  
- `position`: Last position held at the company.  
- `satisfaction_with_job`: Satisfaction of their job, on lickert scale.  
- `married_or_single`: Marital status of person.  
- `last_raise_pct`: Percent increase their last raise represented.  
- `last_performance_rating`: Most recent annual performance rating, on lickert scale.  
- `total_years_working`: Number of years the individual has spent working in their career.  
- `years_at_company`: Number of years the individual has been at the company, regardless of position.  
- `years_in_current_job`: Number of years the individual has been in their current position.  
- `years_since_last_promotion`: Years since the person had their last promotion.  
- `years_with_current_supervisor`: Years the person has had their current supervisor.

# Question 1
- What is the distribution of the target (`left_company`)?  
- Do you have any concerns on class imbalances?

In [2]:
# insert code
df['left_company'].value_counts()

No     500
Yes    500
Name: left_company, dtype: int64

The output indicates that the data is evenly distributed between the two classes since it displays 500 workers who have left the firm ("Yes") and 500 who have not ("No"). When choosing the right assessment metrics and tactics for a machine learning model that is being trained on this data, this knowledge might be useful.

# Question 2
- Create and print a list of the variables that you would treat as numerical and another list for the variables that you would treat as categorical.  
- Explain your choices.

In [3]:
# numerical_vars = []
numerical_vars = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
print("Numerical variables:", numerical_vars)
# categorical_Vars = []
categorical_vars = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical variables:", categorical_vars)

Numerical variables: ['age', 'commuting_distance', 'education', 'satisfaction_with_environment', 'seniority_level', 'satisfaction_with_job', 'last_raise_pct', 'last_performance_rating', 'total_years_working', 'years_at_company', 'years_in_current_job', 'years_since_last_promotion', 'years_with_current_supervisor']
Categorical variables: ['left_company', 'frequency_of_travel', 'department', 'gender', 'position', 'married_or_single']


Based on accepted practices in data analysis and machine learning, the selected data formats for numerical and categorical variables have been chosen. Categorical variables indicate discrete categories or labels that cannot be altered in the same manner as continuous values, whereas numerical variables may often be thought of as continuous values that can be mathematically modified.

# Question 3
- Determine if any numerical variables risk multicolinearity.  
- Remove those variables (if any) from your numerical_vars list.  
- Why did you or did not remove any?

In [4]:
corr_mat = df[numerical_vars].corr()

In [5]:
highly_corr_vars = set()
for i in range(len(numerical_vars)):
    for j in range(i+1, len(numerical_vars)):
        if abs(corr_mat.iloc[i, j]) >= 0.7:
            va1 = corr_mat.columns[i]
            va2 = corr_mat.columns[j]
            highly_corr_vars.add(va1)
            highly_corr_vars.add(va2)

print("Highly correlated numerical variables:", highly_corr_vars)

Highly correlated numerical variables: {'last_raise_pct', 'years_with_current_supervisor', 'years_in_current_job', 'years_at_company', 'total_years_working', 'seniority_level', 'last_performance_rating'}


In [6]:
for var in highly_corr_vars:
    if var in numerical_vars:
        numerical_vars.remove(var)
        

To prevent multicollinearity, a typical problem in regression analysis when two or more predictor variables are significantly associated with one another, the highly correlated number variables were eliminated from the list. By lessening the effect of multicollinearity on the model's coefficients and predictions, deleting one of the correlated variables in these situations might enhance the model's performance and interpretability.

In order to prevent multicollinearity and make sure that the remaining numerical variables are neither redundant or strongly correlated with one another, the software deletes the highly correlated numerical variables from the numerical vars list.

# Question 4
- Split the data into training and test sets.  
- Use 20% of the data for test and a random state of 124.  

In [7]:
# insert code here
from sklearn.model_selection import train_test_split

# Split the data into X (input features) and y (target variable)
X = df
y = df['left_company']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=124)


# Question 5
- Create a pipeline to process the numerical data.  
- Create a pipeline to process the categorical data.  

Verify each pipeline contains the columns you would expect using a fit_transform on the training data, i.e., print the shapes of the fit_transforms for each pipeline.

In [8]:
# insert code here
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Pipeline for numerical data
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

# Pipeline for categorical data
categorial_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

In [9]:
# Fit and transform the training data using the numerical pipeline
X_train_num = numerical_pipeline.fit_transform(X_train[numerical_vars])
print("Numerical data shape after transformation:", X_train_num.shape)

# Fit and transform the training data using the categorical pipeline
X_train_cat = categorial_pipeline.fit_transform(X_train[categorical_vars])
print("Categorical data shape after transformation:", X_train_cat.shape)

Numerical data shape after transformation: (800, 6)
Categorical data shape after transformation: (800, 22)


# Question 6
- Create a pipeline that combines the pre-processing and implements a logistic regression model.  
- Print the accuracy on the training set and the test set.
- Do you have any concerns of overfitting based on the differences between the two accuracy scores?

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_vars),
        ('cat', categorial_pipeline, categorical_vars)
    ])

# Define the complete pipeline
pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('logistic_regression', LogisticRegression(random_state=42))
])

# Fit the pipeline to the training data
pipe.fit(X_train, y_train)

# Predict on the training and test sets
y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

# Calculate the accuracy on the training and test sets
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

# Print the accuracy on the training and test sets
print("Accuracy on the training set:", train_acc)
print("Accuracy on the test set:", test_acc)

Accuracy on the training set: 1.0
Accuracy on the test set: 1.0


Yeah, the accuracy score of 1.0 on the training and test sets demonstrates the model's extremely high level of performance. It also calls attention to the issue of overfitting, which occurs when a model is too complicated and captures noise and randomness in the training data rather than underlying patterns and correlations. Overfitting is a prevalent issue in machine learning.

As the accuracy on the test set is likewise 1.0 in this instance, it's likely that the model has memorized the training data rather than generalizing to new data because of how well it has learnt it. When the model is applied to real-world data that differs from the training data, this might result in subpar performance.

# Question 7
What would you recommend as potential next steps for continuing to develop and evaluate a model?

The model is probably overfitting to the training data, according to the printed accuracies of 1.0 for both the training and test sets. This may occur if the model is very complicated in relation to the size of the training set or if the data are noisy. Under these circumstances, the model could perform admirably on the training set but badly on fresh data. To determine how well the model generalizes, it's critical to assess its performance on a separate test set. 
Using regularization techniques, like as L1 or L2 regularization, can assist to minimize the complexity of the model and prevent it from fitting the noise in the data, which is one strategy to combat overfitting. Another strategy is to employ a less complex model that may be less prone to overfitting, like a decision tree or random forest.