# <center> COVID's IMPACT on GENDER EQUALITY - Analysis 2 </center> #

## <center> Using Machine Learning to Predict Gender Equality (Job) </center> ##

### <center> Prediction Based on Aggregated Data for 141 Countries </center> ##

* Dependent Variable Name: job
* Dependent Variable Meaning: A woman can get a job in the same way as a man (1=yes; 0=no)



In [1]:
# Import dependencies
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

## <center> Split the Data into Training and Testing Sets </center> ##
### Step 1: Read the lending_data.csv data from the Resources folder into a Pandas DataFrame. ###

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path("CleanData/COVID_FinalProject.csv")
covid_df = pd.read_csv(file_path)

# Review the DataFrame
covid_df.head()

Unnamed: 0,Country_Code,Country_Name,Continent,Population,people_fully_vaccinated,median_age,total_cases,total_deaths,life_expectancy,human_development_index,...,divorce_choice,bank_account,business_ownership,contract_signature,domestic_travel,international_travel,work_at_night,dangerous_job,industrial_job,remarry
0,ABW,Aruba,North America,106459,32884.25233,41.2,21306.4472,142.32764,76.29,0.690681,...,0,1,1,0,1,1,1,1,1,1
1,AFG,Afghanistan,Asia,41128772,723061.4592,18.6,122011.0715,4977.338772,64.83,0.511,...,0,1,1,0,1,0,0,0,0,0
2,AGO,Angola,Africa,35588996,325527.2137,16.8,56850.80497,1143.634033,61.15,0.581,...,1,1,1,1,1,1,1,0,0,1
3,ALB,Albania,Europe,2842318,127356.554,38.0,175928.4351,2275.651127,78.57,0.795,...,1,1,1,1,1,1,1,1,1,1
4,AND,Andorra,Europe,79843,1542.322455,31.888298,23705.44134,113.786325,83.73,0.868,...,0,1,1,0,1,1,1,1,1,0


In [3]:
covid_new_df = covid_df.drop(columns=['Country_Code', 'Country_Name', 'Continent', 'Income', 'Agency_Name', 'Agency_Acronym'])
covid_new_df.head()

Unnamed: 0,Population,people_fully_vaccinated,median_age,total_cases,total_deaths,life_expectancy,human_development_index,aged_65_older,aged_70_older,gdp_per_capita,...,divorce_choice,bank_account,business_ownership,contract_signature,domestic_travel,international_travel,work_at_night,dangerous_job,industrial_job,remarry
0,106459,32884.25233,41.2,21306.4472,142.32764,76.29,0.690681,13.085,7.452,35973.781,...,0,1,1,0,1,1,1,1,1,1
1,41128772,723061.4592,18.6,122011.0715,4977.338772,64.83,0.511,2.581,1.337,1803.987,...,0,1,1,0,1,0,0,0,0,0
2,35588996,325527.2137,16.8,56850.80497,1143.634033,61.15,0.581,2.405,1.362,5819.495,...,1,1,1,1,1,1,1,0,0,1
3,2842318,127356.554,38.0,175928.4351,2275.651127,78.57,0.795,13.188,8.643,11803.431,...,1,1,1,1,1,1,1,1,1,1
4,79843,1542.322455,31.888298,23705.44134,113.786325,83.73,0.868,8.987312,5.70623,12983.7741,...,0,1,1,0,1,1,1,1,1,0


### Step 2: Create the labels set (y) from the “job” column, and then create the features (X) DataFrame from the remaining columns. ###

In [4]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = covid_new_df['job']

# Separate the X variable, the features
X = covid_new_df.drop(columns=['job'])

In [5]:
# Review the y variable Series
y.head()

0    1
1    1
2    1
3    1
4    0
Name: job, dtype: int64

In [6]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,Population,people_fully_vaccinated,median_age,total_cases,total_deaths,life_expectancy,human_development_index,aged_65_older,aged_70_older,gdp_per_capita,...,divorce_choice,bank_account,business_ownership,contract_signature,domestic_travel,international_travel,work_at_night,dangerous_job,industrial_job,remarry
0,106459,32884.25233,41.2,21306.4472,142.32764,76.29,0.690681,13.085,7.452,35973.781,...,0,1,1,0,1,1,1,1,1,1
1,41128772,723061.4592,18.6,122011.0715,4977.338772,64.83,0.511,2.581,1.337,1803.987,...,0,1,1,0,1,0,0,0,0,0
2,35588996,325527.2137,16.8,56850.80497,1143.634033,61.15,0.581,2.405,1.362,5819.495,...,1,1,1,1,1,1,1,0,0,1
3,2842318,127356.554,38.0,175928.4351,2275.651127,78.57,0.795,13.188,8.643,11803.431,...,1,1,1,1,1,1,1,1,1,1
4,79843,1542.322455,31.888298,23705.44134,113.786325,83.73,0.868,8.987312,5.70623,12983.7741,...,0,1,1,0,1,1,1,1,1,0


### Step 3: Check the balance of the labels variable (y) by using the value_counts function. ###

In [7]:
# Check the balance of our target values
y.value_counts()

1    127
0     13
Name: job, dtype: int64

### Step 4: Split the data into training and testing datasets by using train_test_split. ###

In [8]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

## <center> Create a Logistic Regression Model with the Original Data </center> ##
### Step 1: Fit a logistic regression model by using the training data (X_train and y_train). ###

In [9]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logit_regression = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
logit_model = logit_regression.fit(X_train, y_train)

### Step 2: Save the predictions on the testing data labels by using the testing feature data (X_test) and the fitted model. ###

In [10]:
# Make a prediction using the testing data
job_predictions = logit_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following: ###
* Calculate the accuracy score of the model. 

* Generate a confusion matrix. 

* Print the classification report. 

In [11]:
# Print the balanced_accuracy score of the model
#Import the accuracy_score module from sklearn
from sklearn.metrics import accuracy_score

print(f"Balanced Accuracy Score: {balanced_accuracy_score(y_test, job_predictions)}")

Balanced Accuracy Score: 0.484375


In [12]:
# Generate a confusion matrix for the model
from sklearn.metrics import confusion_matrix
testing_model_matrix = confusion_matrix(y_test, job_predictions)
print(testing_model_matrix)

[[ 0  3]
 [ 1 31]]


In [13]:
# Print the classification report for the model
job_report = classification_report(y_test, job_predictions)
print(job_report)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.91      0.97      0.94        32

    accuracy                           0.89        35
   macro avg       0.46      0.48      0.47        35
weighted avg       0.83      0.89      0.86        35



## <center> Results: 48% Accuracy Rate </center> ##

* Our results predicted women's ability to get the same job as a man worldwide following COVID with a low accuracy rate of 48%. 

* This rate is very low and reflects that more variables could be taken into consideration to produce a more accurate model. 

* This model aggregated 141 countries across the world (different continents and economic indicators), therefore the model is not as robust as it would have been if focused on specific countries, regions, or continents with similar socio-economic indicators.