**Salary Increase By Type of College**

Party school? Liberal Arts college? State School? You already know your starting salary will be different depending on what type of school you attend.

But, increased earning power shows less disparity. Ten years out, graduates of Ivy League schools earned 99% more than they did at graduation. Party school graduates saw an 85% increase. Engineering school graduates fared worst, earning 76% more 10 years out of school. See where your school ranks.


**Salaries By Region**

Attending college in the Midwest leads to the lowest salary both at graduation and at mid-career, according to the PayScale Inc. survey. Graduates of schools in the Northeast and California fared best.


**Salary Increase By Major**

Your parents might have worried when you chose Philosophy or International Relations as a major. But a year-long survey of 1.2 million people with only a bachelor's degree by PayScale Inc. shows that graduates in these subjects earned 103.5% and 97.8% more, respectively, about 10 years post-commencement. Majors that didn't show as much salary growth include Nursing and Information Technology.

All data was obtained from the Wall Street Journal based on data from Payscale, Inc:

Salaries for Colleges by Type

Salaries for Colleges by Region

Degrees that Pay you Back

In [3]:
## Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
## Import datasets

## df_undergrad --> by major
## df_type --> by college type
## df_region --> by college region

## All of them have 8 colums, including starting salary, mid-carrer salary and percentile salaries

df_undergrad = pd.read_csv("degrees-that-pay-back.csv")
df_type = pd.read_csv("salaries-by-college-type.csv")
df_region = pd.read_csv("salaries-by-region.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'degrees-that-pay-back.csv'

In [None]:
## Looking at the first dataset about majors
df_undergrad.head()

In [5]:
## Renaming the columns to smaller texts
df_undergrad.columns = ["major", "start_sal", "mid_sal", "p_change", "mid_10p", "mid_25p", "mid_75p", "mid_90p"]
df_undergrad.head()

NameError: name 'df_undergrad' is not defined

In [None]:
## Checking the type of the columns
## Only the percent change column have the type float
## All other columns have the type object
df_undergrad.info()

In [None]:
## Checking what the object type means
## Using the first row of the starting salary column to check
## Type is string
## All columns with the dollar sign ($) are strings
type(df_undergrad["start_sal"][0])

In [None]:
## Converting all columns with the dollar sign from strings to numbers
## Using str.replace() and pd.to_numeric()

dollar_columns = ["start_sal", "mid_sal", "mid_10p", "mid_25p", "mid_75p", "mid_90p"]
for col in dollar_columns:
    df_undergrad[col] = df_undergrad[col].str.replace("$", "")
    df_undergrad[col] = df_undergrad[col].str.replace(",", "")
    df_undergrad[col] = pd.to_numeric(df_undergrad[col])
    df_undergrad[col] = df_undergrad[col] / 1000

df_undergrad.head()

In [None]:
## Checking type of columns again
## All number columns are now float
df_undergrad.info()

In [None]:
## Analysing data from college majors dataset
## There are 50 undergraduate majors
## Medical school and law school are graduate majors in the US, so they aren't in the data

## Average starting salary for an undergraduate degree holder is around $41000 a year, with most people getting paid between $37000 and $50000 a year
## Average mid-career salary is $72000, with most people receiving between $60000 and $89000
## Salary varies a lot depending on the major (big standard deviation)
df_undergrad.describe()

In [None]:
## Sorting majors by starting salary
df_undergrad.sort_values(by = "start_sal", ascending = True, inplace = True)

## Physician Assistant have the highest starting salary
## Requires bachelor's degree in science (registered nurse, paramedics), 2-year master's degree program and license from the state

## Chemical engineering have the highest starting salary with undergraduate degree only
## Followed by various degrees of engineering, computer science and nursing
df_undergrad.head(10)

In [None]:
## Reseting index by descending starting salary to create a graph
df_undergrad = df_undergrad.reset_index()
df_undergrad.head(10)

In [None]:
## Creating first graph
x = df_undergrad.index
y = df_undergrad.start_sal
labels = df_undergrad.index

plt.scatter(x, y, color = "green", label = "Starting median salary")

## Shows all labels on the x-axis (to be changed to major names)
## Otherwise it would show spaced numbers (0, 10, 20, and so on)
plt.xticks(x, labels)

plt.xlabel("Index")
plt.ylabel("Thousand US Dollars")
plt.title("Starting Median Salary by Major")
plt.legend()
plt.show()

In [None]:
## Adding major names along the x-axis
x = df_undergrad.index
y = df_undergrad.start_sal

## Changing label from index to major column
labels = df_undergrad.major

plt.scatter(x, y, color = "green", label = "Starting median salary")

## Rotating major names to vertical position (too much overlap on the horizontal one)
plt.xticks(x, labels, rotation = "vertical")

plt.xlabel("Major")
plt.ylabel("Thousand US Dollars")
plt.title("Starting Median Salary by Major")
plt.legend()
plt.show()


In [None]:
## Flipping the y-axis and the x-axis for better visualization
x = df_undergrad.start_sal
y = df_undergrad.index
labels = df_undergrad.major

plt.scatter(x, y, color = "green", label = "Starting median salary")

## Showing all majors in the y-axis
plt.yticks(y, labels)

## Changing x-axis label to US Dollars
## Excluding y-axis label
plt.xlabel("Thousand US $")
plt.ylabel("")

plt.title("Starting Median Salary by Major")
plt.legend()
plt.show()

In [None]:
## Increasing size of image
fig = plt.figure(figsize=(8,12))

x = df_undergrad.start_sal
y = df_undergrad.index
labels = df_undergrad.major

plt.scatter(x, y, color = "green", label = "Starting median salary")
plt.yticks(y, labels)

plt.xlabel("Thousand US $")
plt.ylabel("")

plt.title("Starting Median Salary by Major")
plt.legend()
plt.show()

In [None]:
## Adding mid-career salary to the graph
fig = plt.figure(figsize=(8,12))

x = df_undergrad.start_sal
y = df_undergrad.index
labels = df_undergrad.major

plt.scatter(x, y, color = "green", label = "Starting median salary")
plt.yticks(y, labels)

x2 = df_undergrad.mid_sal
plt.scatter(x2, y, color = "blue", label = "Mid-career median salary")

plt.xlabel("Thousand US $")
plt.ylabel("")

plt.title("Starting Median Salary by Major")
plt.legend(loc=2)
plt.show()

In [None]:
## Sorting by mid-career median salary

## Engineering majors dominate the top
## Physician Assistante and Nursing have fallen in the ranking
## Economics, Physics and Math have risen by a good margin

df_undergrad2 = df_undergrad.sort_values(by = "mid_sal", ascending = True)
df_undergrad2 = df_undergrad2.reset_index()

fig = plt.figure(figsize=(8,12))

x = df_undergrad2.start_sal
y = df_undergrad2.index
labels = df_undergrad2.major

plt.scatter(x, y, color="gray", label = "Starting Median Salary")
plt.yticks(y, labels)

x2 = df_undergrad2.mid_sal
plt.scatter(x2, y, color = "green", label = "Mid-career Median Salary")

plt.xlabel("Thousand US $")
plt.ylabel("")
plt.title("Salary Information by Major")
plt.legend(loc = 2)
plt.show()

In [None]:
## Deleting starting salary
## Adding 25th and 75th percentile to the graph

## Economics rank 1st in the 75th percentile

df_undergrad2 = df_undergrad.sort_values(by = "mid_sal", ascending = True)
df_undergrad2 = df_undergrad2.reset_index()

fig = plt.figure(figsize=(8,12))

x = df_undergrad2.mid_25p
y = df_undergrad2.index
labels = df_undergrad2.major

plt.scatter(x, y, color="yellow", label = "25th percentile Median Salary")
plt.yticks(y, labels)

x2 = df_undergrad2.mid_sal
plt.scatter(x2, y, color = "green", label = "Mid-career Median Salary")

x3 = df_undergrad2.mid_75p
plt.scatter(x3, y, color = "blue", label = "75th percentile Median Salary")

plt.xlabel("Thousand US $")
plt.ylabel("")
plt.title("Salary Information by Major")
plt.legend(loc = 2)
plt.show()

In [None]:
## Adding the 10th and 90th percentile to the graph

## Economics ranks 1st, followed by finance in the 90th percentile

df_undergrad2 = df_undergrad.sort_values(by = "mid_sal", ascending = True)
df_undergrad2 = df_undergrad2.reset_index()

fig = plt.figure(figsize=(8,12))

x = df_undergrad2.mid_25p
y = df_undergrad2.index
labels = df_undergrad2.major

plt.scatter(x, y, color="yellow", label = "25th percentile Median Salary")
plt.yticks(y, labels)

x2 = df_undergrad2.mid_sal
plt.scatter(x2, y, color = "green", label = "Mid-career Median Salary")

x3 = df_undergrad2.mid_75p
plt.scatter(x3, y, color = "blue", label = "75th percentile Median Salary")

x4 = df_undergrad2.mid_10p
plt.scatter(x4, y, color = "#f7e9ad", label = "10th percentile Median Salary")

x5 = df_undergrad2.mid_90p
plt.scatter(x5, y, color = "#a1b6f0", label = "90th percentile Median Salary")

plt.xlabel("Thousand US $")
plt.ylabel("")
plt.title("Salary Information by Major")

## Moving legend out of the graph
plt.legend(loc = "upper right", bbox_to_anchor=(1.46,.98))
plt.show()

In [None]:
## Adding grid to the graph

df_undergrad2 = df_undergrad.sort_values(by = "mid_sal", ascending = True)
df_undergrad2 = df_undergrad2.reset_index()

fig = plt.figure(figsize=(8,12))

## Coloring the grid lines
plt.rc('grid', alpha = .5, color = '#e3dfdf')
## Coloring the graph edges
plt.rc('axes', edgecolor = '#67746A')

x = df_undergrad2.mid_25p
y = df_undergrad2.index
labels = df_undergrad2.major

plt.scatter(x, y, color="yellow", label = "25th percentile Median Salary")
plt.yticks(y, labels)

x2 = df_undergrad2.mid_sal
plt.scatter(x2, y, color = "green", label = "Mid-career Median Salary")

x3 = df_undergrad2.mid_75p
plt.scatter(x3, y, color = "blue", label = "75th percentile Median Salary")

x4 = df_undergrad2.mid_10p
plt.scatter(x4, y, color = "#f7e9ad", label = "10th percentile Median Salary")

x5 = df_undergrad2.mid_90p
plt.scatter(x5, y, color = "#a1b6f0", label = "90th percentile Median Salary")

plt.xlabel("Thousand US $")
plt.ylabel("")
plt.title("Salary Information by Major")

plt.legend(loc = "upper right", bbox_to_anchor=(1.46,.98))
plt.grid(True)
plt.show()

In [None]:
## Looking at second dataframe (region)
## Adds school name and region
df_region.head()

In [None]:
## Changing columns names
df_region.columns = ["name", "region", "start_sal", "mid_sal", "mid_10p", "mid_25p", "mid_75p", "mid_90p"]

In [None]:
## Converting strings to float values
dollar_columns = ["start_sal", "mid_sal", "mid_10p", "mid_25p", "mid_75p", "mid_90p"]
for col in dollar_columns:
    df_region[col] = df_region[col].str.replace("$", "")
    df_region[col] = df_region[col].str.replace(",", "")
    df_region[col] = pd.to_numeric(df_region[col])
    df_region[col] /= 1000
df_region.head()

In [None]:
## Searching unique region values
## 5 different regions
df_region.region.unique()

In [None]:
## Sorting by starting salary and reseting index
df_region2 = df_region.sort_values(by = "start_sal", ascending=False)
df_region2 = df_region2.reset_index()
del df_region2["index"]
df_region2.head()

In [None]:
## Checking mean on starting salary
mean_start_sal_region = df_region2.start_sal.mean()
mean_start_sal_region

In [None]:
## Checking standard deviation on starting salary
std_start_sal_region = df_region2.start_sal.std()
std_start_sal_region

In [None]:
## Classifying schools by starting salary
def classify_start_sal(start_sal, mean, std):
    if start_sal > (mean + std):
        return 4
    elif start_sal > mean:
        return 3
    elif start_sal > (mean - std):
        return 2
    else:
        return 1

In [None]:
## Transforming starting salary in categoric data
df_region2.start_sal.map(lambda sal: classify_start_sal(sal, mean_start_sal_region, std_start_sal_region))

In [None]:
## Adding to school region dataframe
df_region2["start_sal_map"] = df_region2.start_sal.map(lambda sal: classify_start_sal(sal, mean_start_sal_region, std_start_sal_region))
df_region2.head()

In [None]:
## Checking number of schools in each category
df_region2.start_sal_map.value_counts()

In [None]:
## Plotting by Starting Salary Map
sns.factorplot(x='region', col='start_sal_map', kind='count', data=df_region2)

In [None]:
## Looking at third dataframe (type)
df_type.head()

In [None]:
## Renaming columns
df_type.columns = ["name", "type", "start_sal", "mid_sal", "mid_10p", "mid_25p", "mid_75p", "mid_90p"]
df_type.head()

In [None]:
## Checking unique type values
for school_type in df_type.type.unique():
    print(school_type)

In [None]:
## Counting type values
## Most are state schools
## Only a few Ivy League and Engineering ones
df_type.type.value_counts()

In [None]:
## Converting strings to float values
dollar_columns = ["start_sal", "mid_sal", "mid_10p", "mid_25p", "mid_75p", "mid_90p"]
for col in dollar_columns:
    df_type[col] = df_type[col].str.replace("$", "")
    df_type[col] = df_type[col].str.replace(",", "")
    df_type[col] = pd.to_numeric(df_type[col])
    df_type[col] /= 1000
df_type.head()

In [None]:
## Sorting by starting salary
## Engineering and Ivy League dominate
df_type2 = df_type.sort_values(by = "start_sal", ascending=False)
df_type2 = df_type2.reset_index()
del df_type2["index"]
df_type2.head(20)

In [None]:
## Checking mean on starting salary
mean_start_sal_type = df_type2.start_sal.mean()
mean_start_sal_type

In [None]:
## Checking standard deviation on starting salary
std_start_sal_type = df_type2.start_sal.std()
std_start_sal_type

In [None]:
## Classifying schools by starting salary
def classify_start_sal(start_sal, mean, std):
    if start_sal > (mean + std):
        return 4
    elif start_sal > mean:
        return 3
    elif start_sal > (mean - std):
        return 2
    else:
        return 1

In [None]:
## Transforming starting salary in categoric data
df_type2.start_sal.map(lambda sal: classify_start_sal(sal, mean_start_sal_type, std_start_sal_type))

In [None]:
## Adding to school type dataframe
df_type2["start_sal_map"] = df_type2.start_sal.map(lambda sal: classify_start_sal(sal, mean_start_sal_type, std_start_sal_type))
df_type2.head()

In [None]:
## Checking number of schools in each category
df_type2.start_sal_map.value_counts()

In [None]:
## Plotting by Starting Salary Map
sns.factorplot(x='type', col='start_sal_map', kind='count', data=df_type2)

In [None]:
## Returning to undergrad dataset
df_undergrad.info()

In [None]:
## Creating answer dataset
df_undergrad_answer = df_undergrad['mid_sal']
df_undergrad_answer.head()

In [None]:
## Preprocessing data
## Eliminating mid-salary (target column), major, probability change and index columns
df_undergrad = df_undergrad.drop(['mid_sal', 'major', 'p_change', 'index'], axis=1)
df_undergrad.head()

In [None]:
## Splitting dataset between test and train
from sklearn.model_selection import train_test_split
undergrad_train, undergrad_test, answer_train, answer_test = train_test_split(
    df_undergrad, df_undergrad_answer, test_size = 0.3, random_state = 42)

In [None]:
undergrad_train_copy = undergrad_train
undergrad_test_copy = undergrad_test
answer_train_copy = answer_train
answer_test_copy = answer_test
df_undergrad_copy = df_undergrad                   #############
df_undergrad_answer_copy = df_undergrad_answer         ###############

In [None]:
## 70% train = 35 courses
undergrad_train.info()

In [None]:
## 30% test = 15 courses
undergrad_test.info()

In [None]:
## Answer dataset became series
answer_train.head()

In [None]:
## Linear regression on train dataset
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(undergrad_train, answer_train)

In [None]:
## Predict test dataset with linear regression
answer_predict = regr.predict(undergrad_test)
answer_predict

In [None]:
## Converting to array
## Fixes bug when plotting
answer_test = answer_test.array
answer_test

In [None]:
## Converting to series and sorting values
answer_test = pd.Series(answer_test)
answer_predict = pd.Series(answer_predict)

answer_test = answer_test.sort_values(ascending=True)
answer_predict = answer_predict.sort_values(ascending=True)

In [None]:
## Calculating absolute mean error
mean_error = []
for item in range(len(answer_test)):
    mean_error.append(abs(answer_test[item] - answer_predict[item]))

In [None]:
## Absolute mean error
mean_error

In [None]:
## Total mean error
total_mean_error = 0.0
for error in mean_error:
    total_mean_error += error
total_mean_error

In [None]:
## Percentile mean error
perc_mean_error = []
for item in range(len(mean_error)):
    perc_mean_error.append(mean_error[item] * 100/ answer_test[item])
perc_mean_error

In [None]:
## Total percentile mean error
total_perc_mean_error = 0.0
for error in perc_mean_error:
    total_perc_mean_error += error
total_perc_mean_error

In [None]:
## Linear regression coefficient and intercept
print(regr.coef_)
print()
print(regr.intercept_)

In [None]:
## Plotting test dataset with linear regression model
## Blue dots are expected values
## Red line is model
x_plot = np.arange(1.0, 16.0, 1.0)
outcome = np.dot(undergrad_test, regr.coef_) + regr.intercept_
outcome = pd.Series(outcome).sort_values(ascending = True)
plt.scatter(x_plot, answer_test)
plt.plot(x_plot, outcome, 'r--')
plt.show()

In [None]:
## Plotting with entire dataset (train + test)
x2_plot = np.arange(1.0, 51.0, 1.0)
outcome_total = np.dot(df_undergrad, regr.coef_) + regr.intercept_
outcome_total = pd.Series(outcome_total).sort_values(ascending = True)
plt.scatter(x2_plot, df_undergrad_answer.sort_values(ascending = True))
plt.plot(x2_plot, outcome_total, 'r--')
plt.show()

In [None]:
## Creating array 1s to calculate accuracy score

all_1s_test =[]
for item in range(len(perc_mean_error)):
    all_1s_test.append(1)

all_1s_test

In [None]:
## Creating categorical values using percentile mean error
## Returns 1 if error <= 2%
## Returns 0 if error > 2%

categorical_perc_mean_error_test = []
for item in range(len(perc_mean_error)):
    if perc_mean_error[item] > 2:
        categorical_perc_mean_error_test.append(0)
    else:
        categorical_perc_mean_error_test.append(1)

categorical_perc_mean_error_test

In [None]:
## Calculating accuracy score of test dataset

from sklearn.metrics import accuracy_score
acc_score_test = accuracy_score(all_1s_test, categorical_perc_mean_error_test)

acc_score_test

In [None]:
## Creating array of 0s and 1s of entire dataset (train + test)
## 50 total values
## First and last 10 values are 0
## Mid 30 values are 1

a01_total = []
for item in range(len(outcome_total)):
    if item < 10:
        a01_total.append(0)
    elif item < 40:
        a01_total.append(1)
    else:
        a01_total.append(0)

a01_total.count(0)


In [None]:
## Categorical values of entire dataset with 2% error
## Using different values for extreme cases (first and last 10 values)
## Mid 30 values are normal
## Will use to create confusion matrix later

df_undergrad_answer = df_undergrad_answer.sort_values(ascending=True)
outcome_total = outcome_total.sort_values(ascending=True)

categorical_perc_mean_error_total = []
for item in range(len(outcome_total)):

    ## Percentile error of entire dataset
    perc_error = ((abs(df_undergrad_answer[item] - outcome_total[item]) * 100) / df_undergrad_answer[item])

    ## Big error
    if (perc_error > 2):
        ## Mid values are normal
        if item >= 10 and item < 40:
            categorical_perc_mean_error_total.append(0)
        ## Extreme values are flipped
        else:
            categorical_perc_mean_error_total.append(1)

    ## Small error
    else:
        ## Mid values are normal
        if item >= 10 and item < 40:
            categorical_perc_mean_error_total.append(1)
        ## Extreme values are flipped
        else:
            categorical_perc_mean_error_total.append(0)

categorical_perc_mean_error_total = np.array(categorical_perc_mean_error_total)
categorical_perc_mean_error_total

In [None]:
## Accuracy score of entire dataset (train + test)

acc_score_total = accuracy_score(a01_total, categorical_perc_mean_error_total)

acc_score_total

In [None]:
## Creating confusion matrix
from sklearn.metrics import confusion_matrix

matriz = confusion_matrix(a01_total, categorical_perc_mean_error_total)

## Ravel converts 2x2 matrix to tuple
tn, fp, fn, tp = matriz.ravel()
print("tn = %i, fp = %i, fn = %i, tp = %i" % (tn, fp, fn, tp))

## Transpose of matrix to create heatmap correctly
sns.heatmap(matriz.T, annot = True)
plt.xlabel("Target")
plt.ylabel("Predicted")
plt.show()

In [None]:
## Analysing confusion matrix results

## Accuracy of extreme values (first and last 10) > accuracy of mid 30 values

print(f"Accuracy of extreme values: {tn / (tn + fp)}")
print(f"Accuracy of mid values: {tp / (fn + tp)}")

In [None]:
x2_plot = np.arange(1.0, 51.0, 1.0)
plt.scatter(x2_plot, df_undergrad_answer.sort_values(ascending = True))
plt.plot(x2_plot, outcome_total, 'r--')
plt.show()

In [None]:
undergrad_train.head()

In [None]:
answer_train.head()

In [None]:
df_undergrad_answer.head()

In [None]:
answer_predict.head()

In [None]:
# Linear regression performance
mean_error = []
for item in range(len(answer_predict)):
    mean_error.append(abs(df_undergrad_answer[item] - answer_predict[item]))
mean_error

In [None]:
# Total dataset
df_undergrad_answer.count()

In [None]:
# Test dataset
answer_predict.count()

In [None]:
# Multilayer perceptron used to retrain dataset
# Linear regression's performance was too poor
from sklearn.model_selection import train_test_split
undergrad_train = undergrad_train_copy.sort_index(ascending = True)

answer_train = answer_train_copy.sort_index(ascending = True)

undergrad_test = undergrad_test_copy.sort_index(ascending = True)

answer_test = answer_test_copy.sort_index(ascending = True)

df_undergrad = df_undergrad_copy.sort_index(ascending = True)       ##############################

df_undergrad_answer = df_undergrad_answer.sort_index(ascending = True)  ################################

# MLP Classifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
mlp_model = make_pipeline(StandardScaler(), MLPClassifier(random_state=1, max_iter=10000))

# Changed values to int to fix bug
# Don't know why it solves the problem
mlp_model.fit(X=undergrad_train.astype(int), y=answer_train.astype(int))
answer_predict = mlp_model.predict(undergrad_test.astype(int))
answer_predict = list(answer_predict)
answer_predict

In [None]:
# Creates list of answer_test
answer_test_converted = []
for item in answer_test:
    answer_test_converted.append(item)
answer_test = answer_test_converted

In [None]:
answer_test

In [None]:
# Difference between predicted and real values
model_diff = []
counter = 0
while counter != len(answer_test):
    model_diff.append(answer_predict[counter] - answer_test[counter])
    counter += 1
model_diff

In [None]:
# Percentile difference
model_diff_perc = []
counter = 0
while counter != len(answer_test):
    model_diff_perc.append(model_diff[counter] * 100 / answer_test[counter])
    counter += 1
model_diff_perc

In [None]:
# Categorizing values in answer
counter = 0
while counter != len(answer_test):
    if answer_test[counter] < 60:
        answer_test[counter] = 0
        counter += 1
        continue
    elif answer_test[counter] < 70:
        answer_test[counter] = 1
        counter += 1
        continue
    elif answer_test[counter] < 80:
        answer_test[counter] = 2
        counter += 1
        continue
    elif answer_test[counter] < 90:
        answer_test[counter] = 3
        counter += 1
        continue
    elif answer_test[counter] < 100:
        answer_test[counter] = 4
        counter += 1
        continue
    elif answer_test[counter] < 110:
        answer_test[counter] = 5
        counter += 1
answer_test

In [None]:
# Categorizing values in predict
counter = 0
while counter != len(answer_predict):
    if abs(model_diff_perc[counter]) < 5:
        answer_predict[counter] = answer_test[counter]
        counter += 1
        continue
    elif answer_predict[counter] < 60:
        answer_predict[counter] = 0
        counter += 1
        continue
    elif answer_predict[counter] < 70:
        answer_predict[counter] = 1
        counter += 1
        continue
    elif answer_predict[counter] < 80:
        answer_predict[counter] = 2
        counter += 1
        continue
    elif answer_predict[counter] < 90:
        answer_predict[counter] = 3
        counter += 1
        continue
    elif answer_predict[counter] < 100:
        answer_predict[counter] = 4
        counter += 1
        continue
    elif answer_predict[counter] < 110:
        answer_predict[counter] = 5
        counter += 1
answer_predict

In [None]:
# Confusion matrix of predicted values and real values
# Heatmap
matriz_categ = confusion_matrix(answer_test, answer_predict)
sns.heatmap(matriz_categ.T, annot = True)

In [None]:
# Confusion Matrix
matriz_categ

In [None]:
# Accuracy of MLP Classifier
accuracy_score(answer_test, answer_predict)

In [None]:
# Recall Macro
# Numerator is value in diagonal
# Denominator is sum of values in line
# Recall of each line is numerator / denominator
# Recall macro is recall of each line / number of lines
count = 0
recall_macro = []
while count != 6:
    num_diag = matriz_categ[count][count]
    sum_line = sum(matriz_categ[count])
    recall_macro.append(num_diag / (sum_line))
    count += 1
recall_macro = sum(recall_macro) / 6
recall_macro

In [6]:
# Precision Macro
# Numerator is value in diagonal
# Denominator is sum of values in column
# Precision of each line is numerator / denominator
# Precision macro is recall of each line / number of lines
count = 0
precision_macro = []
while count != 6:
    num_diag = matriz_categ[count][count]
    col_count = 0
    sum_column = 0
    while col_count != 6:
        sum_column += matriz_categ[col_count][count]
        col_count += 1
    precision_macro.append(num_diag / (sum_column))
    count += 1
precision_macro = sum(precision_macro) / 6
precision_macro

NameError: name 'matriz_categ' is not defined

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp_model, X=df_undergrad.astype(int), y=df_undergrad_answer.astype(int), cv=5)

In [None]:
df_undergrad_answer.astype(int)