![iut](https://github.com/Hexanol777/STEM-Salaries-Case-Study/tree/main/Phase%201/stock_image/IUT200.png)
<hr style="margin-bottom: 40px;">


# STEM Jobs Salaries - Classification

## Classification

#### Classification is a fundamental task in machine learning and data analysis that involves categorizing data into predefined classes or categories based on their features or attributes. It is a supervised learning technique where the goal is to train a model on labeled training data to make accurate predictions on unseen or test data.

[Link to the Data used in this Notebook](https://drive.google.com/file/d/1IhXv0qcq7YFfBxc0BQB1-z74wF40ZnZn/view?usp=share_link)


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Data Extraction - Importing Modules

 During the data extraction phase, we obtained the data directly from [kaggle.com](www.kaggle.com), which is a popular platform for accessing and sharing datasets. By using Kaggle, we were able to search for and download datasets that were relevant to our analysis which in this case is STEM Jobs Salaries, and we could be confident in the quality of the data provided, as the usability of it was rated 10 in the website. Overall, the data extraction phase was streamlined and efficient, thanks to the availability and accessibility of high-quality data on Kaggle.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore", category=pd.errors.ParserWarning)
%matplotlib inline


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Loading The Initial Data:

In [None]:
!head data/STEMJobs.csv
# Note: incase if you are running this line locally you will be met with the error below
# as this notebook is meant to be executed in Google Colab

In [None]:
Data = pd.read_csv(
    'data/STEM.csv',
    parse_dates=['Timestamp'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data First Look:

In [None]:
Data.head()

In [None]:
Data.info()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data Types


In [None]:
Data.dtypes

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Experienced, Non-experienced Classification



In [None]:
Data['Experienced'] = pd.cut(Data['YearsOfExperience'], bins=[0, 8 , Data.YearsOfExperience.max()], labels=['Unexperienced', 'Experienced'])
Data

In [None]:
exp_counts = Data.Experienced.value_counts()
print(exp_counts)

plt.figure(figsize=(3, 3))
plt.bar(exp_counts.index, exp_counts.values)
plt.title('Experienced vs. Non-Experienced')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data Discretization


In [None]:
deciles = pd.qcut(Data['TotalYearlyCompensation'], q=2, labels=False, duplicates='drop')
decile_table = Data.groupby(deciles)['TotalYearlyCompensation'].describe()
print(decile_table)

In [None]:
labels = ['Very Low Income', 'Low Income', 'Moderate Income', 'High Income', 'Very High Income']
counts = [10203, 9988, 9993, 9921, 10025]
pay = [78290, 140733, 184321, 241417, 411220]

# bar width
bar_width = 0.35

# array indices for the x axis
x = np.arange(len(labels))

# bar chart
plt.figure(figsize=(8, 8))
plt.bar(x - bar_width/2, counts, bar_width, label='Count')
plt.bar(x + bar_width/2, pay, bar_width, label='Average Pay')

# label and titles
plt.xlabel('Income Categories')
plt.ylabel('Count / Average Pay')
plt.title('Distribution of Income')
plt.xticks(x, labels, rotation=45)
plt.legend()

# displaying the chart
plt.tight_layout()
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Binary Encoding - Experienced


In [None]:
Data.Experienced = np.where(Data.Experienced == 'Experienced', 1, 0)
Data

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Label Column - Income


In [None]:
labels = ['Lowerclass']
Data['IncomeLevel'] = pd.cut(Data['TotalYearlyCompensation'], bins=[0, 210000 , Data.TotalYearlyCompensation.max()], labels=['LowerClass', 'HigherClass'])
Data

In [None]:
income_counts = Data['IncomeLevel'].value_counts()
print(income_counts)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Income Level Binary Encoding


In [None]:
Data.IncomeLevel = np.where(Data.IncomeLevel == 'HigherClass', 1, 0)
Data

frame = Data
frame.head()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## LogReg for Experienced



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

x = Data.drop(['YearsOfExperience', 'Experienced', 'Timestamp', 'Level', 'Company', 'Title', 'Country', 'IsUS', 'IsCA', 'IsID', 'IsIN'
               , 'IsDE', 'Tag', 'Gender', 'Education'], axis=1)
y = Data.Experienced



x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)



model = LogisticRegression()

scores = cross_val_score(model, x_train, y_train, cv=5)

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Cross-Validation Scores:", scores)
print("Accuracy:", accuracy)
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)
print("AUC-ROC:", auc_roc)

cm = confusion_matrix(y_test, y_pred)
print(cm)


In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.xticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.yticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.show()

In [None]:
from sklearn.metrics import roc_curve, auc


# Assuming you have predicted probabilities or scores for the positive class
y_scores = model.predict_proba(x_test)[:, 1]

# Calculate the False Positive Rate (FPR) and True Positive Rate (TPR)
fpr, tpr, _ = roc_curve(y_test, y_scores)

# Calculate the AUC score
auc = roc_auc_score(y_test, y_scores)

print(auc)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## KNN for Experienced



In [None]:
from sklearn.neighbors import KNeighborsClassifier


x = Data.drop(['YearsOfExperience', 'Experienced', 'Timestamp', 'Level', 'Company', 'Title', 'Country', 'IsUS', 'IsCA', 'IsID', 'IsIN'
               , 'IsDE', 'Tag', 'Gender', 'Education'], axis=1)
y = Data.Experienced


# Define the range of k values
k_values = range(1, 21)

# Create a dictionary to store the accuracy for each k
accuracy_scores = {}

# Iterate over each k value
for k in k_values:
    # Create and fit the KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    
    # Predict labels for the test set
    y_pred = knn.predict(x_test)
    
    # Calculate the accuracy score
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store the accuracy in the dictionary
    accuracy_scores[k] = accuracy

# Sort the accuracy scores in descending order
sorted_accuracy = sorted(accuracy_scores.items(), key=lambda x: x[1], reverse=True)

# Print the accuracy scores for each k in descending order
for k, accuracy in sorted_accuracy:
    print(f"k = {k}: Accuracy = {accuracy}")

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)
print("AUC-ROC:", auc_roc)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.xticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.yticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Random Forest for Experienced



In [None]:
from sklearn.ensemble import RandomForestClassifier

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)

# Create a Random Forest classifier with default parameters
rf_classifier = RandomForestClassifier()

# Train the classifier on the training data
rf_classifier.fit(x_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(x_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)
print("AUC-ROC:", auc_roc)


In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.xticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.yticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## LogReg for Income Level



In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)



model = LogisticRegression()

scores = cross_val_score(model, x_train, y_train, cv=5)

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Cross-Validation Scores:", scores)
print("Accuracy:", accuracy)
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)
print("AUC-ROC:", auc_roc)

cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.xticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.yticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## KNN for Income Level



In [None]:
x = frame.drop(['Bonus', 'BaseSalary', 'StockGrantValue', 'TotalYearlyCompensation', 'Experienced', 'Timestamp', 'Level', 'Company', 'Title', 'Country', 'IsUS', 'IsCA', 'IsID', 'IsIN'
               , 'IsDE', 'Tag', 'Gender', 'Education', 'IncomeLevel'], axis=1)
y = frame[['IncomeLevel']]


# Define the range of k values
k_values = range(1, 21)

# Create a dictionary to store the accuracy for each k
accuracy_scores = {}

# Iterate over each k value
for k in k_values:
    # Create and fit the KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    
    # Predict labels for the test set
    y_pred = knn.predict(x_test)
    
    # Calculate the accuracy score
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store the accuracy in the dictionary
    accuracy_scores[k] = accuracy

# Sort the accuracy scores in descending order
sorted_accuracy = sorted(accuracy_scores.items(), key=lambda x: x[1], reverse=True)

# Print the accuracy scores for each k in descending order
for k, accuracy in sorted_accuracy:
    print(f"k = {k}: Accuracy = {accuracy}")

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
auc_roc = roc_auc_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

print(f"Accuracy: {accuracy}")
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)
print("AUC-ROC:", auc_roc)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.xticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.yticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Random Forest for Income Level



In [None]:
x = frame.drop(['Bonus', 'BaseSalary', 'StockGrantValue', 'TotalYearlyCompensation', 'Experienced', 'Timestamp', 'Level', 'Company', 'Title', 'Country', 'IsUS', 'IsCA', 'IsID', 'IsIN'
               , 'IsDE', 'Tag', 'Gender', 'Education', 'IncomeLevel'], axis=1)
y = frame[['IncomeLevel']]


# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8)


# Create a Random Forest classifier with default parameters
rf_classifier = RandomForestClassifier()

# Train the classifier on the training data
rf_classifier.fit(x_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(x_test)

# Calculate the metrics of the classifier
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("F1 Score:", f1)
print("Precision:", precision)
print("Recall:", recall)



In [None]:
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.xticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.yticks(ticks=[0.5, 1.5], labels=['0', '1'])
plt.show()

## free me from this suffering

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
