### For this dataset I'm going to perform the simple classification of the Stack Overflow user's main branche's based on their answers in the form.
link to dataset: https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip

In [116]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics


def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

survey_df = pd.read_csv('survey_results_public.csv')
survey_df.head()

Unnamed: 0,ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,...,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Slovakia,,,"Secondary school (e.g. American high school, G...",18 - 24 years,Coding Bootcamp;Other online resources (ex: vi...,,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268.0
1,2,I am a student who is learning to code,"Student, full-time",Netherlands,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",7.0,...,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,
2,3,"I am not primarily a developer, but I write co...","Student, full-time",Russian Federation,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",,...,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above,None of the above,Appropriate in length,Easy,
3,4,I am a developer by profession,Employed full-time,Austria,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",11 - 17 years,,,...,35-44 years old,Man,No,Straight / Heterosexual,White or of European descent,I am deaf / hard of hearing,,Appropriate in length,Neither easy nor difficult,
4,5,I am a developer by profession,"Independent contractor, freelancer, or self-em...",United Kingdom of Great Britain and Northern I...,,England,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",5 - 10 years,Friend or family member,17.0,...,25-34 years old,Man,No,,White or of European descent,None of the above,,Appropriate in length,Easy,


First five records in our dataframe so we have general information about values and column names.

## Exercise 1 (data preparation)
a) Remove punctuation in "Employment" column using the given function.   
b) Replace all missing values (nan) in "US_State" and "UK_Country" columns with empty "" string.  
c) Drop all rows from unemployed participants, as we are looking for corelations with being employed.   
d) Set all users without accessibilities to 1 and the rest to -1.

In [117]:
#a)
survey_df.Employment = survey_df.Employment.apply(lambda x: remove_punctuation((str(x))))

#short test: 
survey_df["Employment"][1] == 'Student, full-time'

False

Row that was with punctuation now is not the same as string with same sentence and with punctuation, so it works properly.

In [118]:
#b)
survey_df["US_State"], survey_df["UK_Country"] = survey_df["US_State"].fillna(""), survey_df["UK_Country"].fillna("")

Replacing nan values in columns "US_State" and "UK_Country" to empty strings.

In [119]:
survey_df.Employment.unique()

array(['Independent contractor freelancer or selfemployed',
       'Student fulltime', 'Employed fulltime', 'Student parttime',
       'I prefer not to say', 'Employed parttime',
       'Not employed but looking for work', 'Retired',
       'Not employed and not looking for work', 'nan'], dtype=object)

Looking for values in "Employment" column that states for unemployment.

In [120]:
#c)
survey_df = survey_df.drop(survey_df[(survey_df.Employment == 'I prefer not to say') | (survey_df.Employment == 'nan') | (survey_df.Employment == 'Retired') | (survey_df.Employment == 'Not employed and not looking for work') | (survey_df.Employment == 'Not employed but looking for work')].index)


#short test:
print(sum(survey_df["Employment"] == 'I prefer not to say'),
sum(survey_df["Employment"] == 'nan'),
sum(survey_df["Employment"] == 'Retired'),
sum(survey_df.Employment == 'Not employed and not looking for work'),
sum(survey_df.Employment == 'Not employed but looking for work'))

0 0 0 0 0


Droping values that are for unemployment and checking if it works correctly.

In [121]:
survey_df.Employment.unique()

array(['Independent contractor freelancer or selfemployed',
       'Student fulltime', 'Employed fulltime', 'Student parttime',
       'Employed parttime'], dtype=object)

In [122]:
survey_df.MainBranch.unique()

array(['I am a developer by profession',
       'I am a student who is learning to code',
       'I am not primarily a developer, but I write code sometimes as part of my work',
       'I used to be a developer by profession, but no longer am',
       'I code primarily as a hobby', 'None of these'], dtype=object)

Looking for colums with values that would be adequate for next exercises.

In [123]:
#d) 
def if_student(access):
    if access == "None of the above":
        return 1
    else:
        return -1

survey_df.Accessibility = survey_df.Accessibility.apply(lambda x: if_student(x))
#short test:
sum(survey_df["Accessibility"]**2 != 1)

0

Changing values for people without accessibilities to 1 and -1 for others.

In [124]:
survey_df = survey_df.dropna()

Droping rows with NaN values. I left NaN values in columns that might be important to me for further research and now I'm dropping this rows with nan values, so now I have all rows with values in those important columns.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform education levels into vectors using CountVectorizer. 

In [125]:
from sklearn.feature_extraction.text import CountVectorizer

In [126]:
#a)
edlvl_train, edlvl_test, branch_train, branch_test = train_test_split(survey_df.EdLevel, survey_df.MainBranch, test_size=0.3, random_state=45)


Splitting dataset into training and test sets in ratio 7 to 3.

In [127]:
#b)
vectorizer = CountVectorizer()
vectorizer.fit(survey_df.EdLevel)

edlvl_train_V = vectorizer.transform(edlvl_train)
edlvl_test_V = vectorizer.transform(edlvl_test)

Fitting vectorizer to education level and then do the data transformation.

## Exercise 3 
a) Train LogisticRegression model on training data (education level processed with CountVectorizer, main branches as they were).   
b) Print 10 most positive and 10 most negative words.

In [128]:
#a)
model = LogisticRegression()
model.fit(edlvl_train_V, branch_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Creating model of logistic regression and fitting data.

In [129]:
#b)
coefs = model.coef_[0]
d = {"Coef": coefs, "Words": vectorizer.get_feature_names()}
words_table = pd.DataFrame(d)

Make a dataframe with coefficients and words.

In [130]:
words_table.sort_values(by="Coef", ascending=False).head(10)

Unnamed: 0,Coef,Words
5,0.393567,doctoral
7,0.393567,ed
21,0.393567,ph
20,0.393567,other
16,0.231427,master
17,0.231427,mba
23,0.180021,professional
18,0.180021,md
15,0.180021,jd
11,0.157474,etc


Ten most positive words.

In [131]:
words_table.sort_values(by="Coef").head(10)

Unnamed: 0,Coef,Words
1,-0.637193,associate
8,-0.332281,elementary
22,-0.332281,primary
25,-0.118704,school
2,-0.117136,bachelor
31,0.038388,without
3,0.038388,college
29,0.038388,study
6,0.038388,earning
27,0.038388,some


Ten most negative words.

As we can see most of the positive words are correlated with achived higher education, and negative wrds with duties of uneducated users that are still studying and learning.

## Exercise 4 
a) Predict the sentiment of test data education levels.   
b) Predict the sentiment of test data education levels in terms of probability.   
c) Find five most positive and most negative education levels.   
d) Calculate the accuracy of predictions.

In [132]:
#a)
branch_pred = model.predict(edlvl_test_V) # calculate ratings based on review using model creating by LogisticRegression

Calculate ratings based on review using LogisticRegression model.

In [133]:
#b)
branch_pred_proba = model.predict_proba(edlvl_test_V)

#hint: model.predict_proba()

In [134]:
#c) 
indexes_P = np.argsort(branch_pred_proba[:,1])[-5:]
print("Positive:\n", edlvl_test.iloc[indexes_P])

indexes_N = np.argsort(branch_pred_proba[:,0])[-5:]
print("\nNegative:\n", edlvl_test.iloc[indexes_N])

Positive:
 36572    Other doctoral degree (Ph.D., Ed.D., etc.)
12493    Other doctoral degree (Ph.D., Ed.D., etc.)
58874    Other doctoral degree (Ph.D., Ed.D., etc.)
2711     Other doctoral degree (Ph.D., Ed.D., etc.)
74951    Other doctoral degree (Ph.D., Ed.D., etc.)
Name: EdLevel, dtype: object

Negative:
 28387    Primary/elementary school
21402    Primary/elementary school
75647    Primary/elementary school
76123    Primary/elementary school
63260    Primary/elementary school
Name: EdLevel, dtype: object


In [135]:
#d) 
acc_logistic = sum(branch_pred==branch_test)/len(branch_test)
print("Accuracy of predictions:", acc_logistic)

Accuracy of predictions: 0.9399919452275474


As we can see above higher education degrees are classified as positive, and lower education is negative, so our classification is correct.

Because in this dataset there is much less words than in the previous one, redoing this exercise with limited dictionary is pointless.

## Exercise 5
Random Forest Classification.

In [136]:
survey_df.columns

Index(['ResponseId', 'MainBranch', 'Employment', 'Country', 'US_State',
       'UK_Country', 'EdLevel', 'Age1stCode', 'LearnCode', 'YearsCode',
       'YearsCodePro', 'DevType', 'OrgSize', 'Currency', 'CompTotal',
       'CompFreq', 'LanguageHaveWorkedWith', 'LanguageWantToWorkWith',
       'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith',
       'PlatformHaveWorkedWith', 'PlatformWantToWorkWith',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith',
       'MiscTechHaveWorkedWith', 'MiscTechWantToWorkWith',
       'ToolsTechHaveWorkedWith', 'ToolsTechWantToWorkWith',
       'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith', 'OpSys',
       'NEWStuck', 'NEWSOSites', 'SOVisitFreq', 'SOAccount', 'SOPartFreq',
       'SOComm', 'NEWOtherComms', 'Age', 'Gender', 'Trans', 'Sexuality',
       'Ethnicity', 'Accessibility', 'MentalHealth', 'SurveyLength',
       'SurveyEase', 'ConvertedCompYearly'],
      dtype='object')

Looking for useful columns for the classification.

In [137]:
X = pd.get_dummies(survey_df[['EdLevel', 'Employment', 'MentalHealth', 'LearnCode', 'SurveyEase', 'Age']])
y = survey_df['MainBranch']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


Choosing multiple features and one label. Also one-hot encoding has to be done because random forest classifier can't work on strings.

In [138]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Fitting data and making prediction.

In [139]:
acc_rand_forest = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", acc_rand_forest)

Accuracy: 0.9327426500201369


Calculating accuracy of our prediction.

## Exercise 6
Ada Boost Classification.

In [140]:
X = pd.get_dummies(survey_df[['EdLevel', 'Employment', 'MentalHealth', 'LearnCode', 'SurveyEase', 'Age']])
y = survey_df['MainBranch']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Choosing same features as for random forest and again doing one-hot encoding.

In [141]:
clf = AdaBoostClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Fitting data and making prediction.

In [142]:
acc_ada = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", acc_ada)

Accuracy: 0.9500604107933951


Calculating accuracy of our prediction.

In [143]:
print(f"Final comparison:\n"
f"LogisticRegression: {acc_logistic}\n"
f"RandomForestClassifier: {acc_rand_forest}\n"
f"AdaBoostClassifier: {acc_ada}\n")

Final comparison:
LogisticRegression: 0.9399919452275474
RandomForestClassifier: 0.9327426500201369
AdaBoostClassifier: 0.9500604107933951

