## Objective

Our Objective was to analyze responses about demographic and programming backgrounds to see what coding language was the most recommended based on these factors. Each observation includes a 'Recommended Language' response column, where they recommend either no language, R, Python, C, SQL, and more for other people to learn first, and seeing which of those responses were most common tells us a lot about both programmers in India as well as the language itself. 

To display this, we went through cleaning, pre-processing, and then finally model creation to show this result clearly. We used a Neural Network model, a Decision Tree model, and a Naive Bayes model to predict the Recommended First Language based on our dataset. Finally we compared the model accuracy and results.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pd.set_option('display.max_rows', 100)

## Data Pre-Processing

First we needed to rename the columns for readability

In [None]:
original = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5', 'Q7_Part_6', 
            'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10', 'Q7_Part_11', 'Q7_Part_12', 'Q7_OTHER', 'Q8', 'Q14_Part_1', 'Q14_Part_2', 
            'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6', 'Q14_Part_7', 'Q14_Part_8', 'Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11', 
            'Q14_OTHER', 'Q16_Part_1', 'Q16_Part_2', 'Q16_Part_3', 'Q16_Part_4', 'Q16_Part_5', 'Q16_Part_6', 'Q16_Part_7', 'Q16_Part_8', 
            'Q16_Part_9', 'Q16_Part_10', 'Q16_Part_11', 'Q16_Part_12', 'Q16_Part_13', 'Q16_Part_14', 'Q16_Part_15', 'Q16_Part_16', 
            'Q16_Part_17', 'Q16_OTHER', 'Q17_Part_1', 'Q17_Part_2', 'Q17_Part_3', 'Q17_Part_4', 'Q17_Part_5', 'Q17_Part_6', 'Q17_Part_7', 
            'Q17_Part_8', 'Q17_Part_9', 'Q17_Part_10', 'Q17_Part_11', 'Q17_OTHER', 'Q18_Part_1', 'Q18_Part_2', 'Q18_Part_3', 'Q18_Part_4', 
            'Q18_Part_5', 'Q18_Part_6', 'Q18_OTHER', 'Q19_Part_1', 'Q19_Part_2', 'Q19_Part_3', 'Q19_Part_4', 'Q19_Part_5', 'Q19_OTHER', 
            'Q20', 'Q21', 'Q22', 'Q23', 'Q24_Part_1', 'Q24_Part_2', 'Q24_Part_3', 'Q24_Part_4', 'Q24_Part_5', 'Q24_Part_6', 'Q24_Part_7', 
            'Q24_OTHER', 'Q25']

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', sep=',',usecols = original)

In [None]:
rewritten = languages_dict = {'Q1': 'Age', 'Q2': 'Gender', 'Q4': 'Education', 'Q5': 'Job', 'Q6': 'Experience', 'Q7_Part_1': 'Python', 
                  'Q7_Part_2': 'R', 'Q7_Part_3': 'SQL', 'Q7_Part_4': 'C', 'Q7_Part_5': 'C++', 'Q7_Part_6': 'Java', 
                  'Q7_Part_7': 'Javascript', 'Q7_Part_8': 'Julia', 'Q7_Part_9': 'Swift', 'Q7_Part_10': 'Bash', 
                  'Q7_Part_11': 'MATLAB', 'Q7_Part_12': 'No_Languages', 'Q7_OTHER': 'Other_Language', 
                  'Q8': 'Recommend_first_language', 'Q14_Part_1': 'Vis_Matplotlib', 'Q14_Part_2': 'Vis_Seaborn', 
                  'Q14_Part_3': 'Vis_Plotly', 'Q14_Part_4': 'Vis_GGPlot', 'Q14_Part_5': 'Vis_Shiney', 
                  'Q14_Part_6': 'Vis_D3JS', 'Q14_Part_7': 'Vis_Altair', 'Q14_Part_8': 'Vis_Bokeh', 
                  'Q14_Part_9': 'Vis_Geoplotlib', 'Q14_Part_10': 'Vis_Folium', 'Q14_Part_11': 'Vis_None', 
                  'Q14_OTHER': 'Vis_Other', 'Q16_Part_1': 'ML_SciKitLearn', 'Q16_Part_2': 'ML_TensorFlow', 
                  'Q16_Part_3': 'ML_Keras', 'Q16_Part_4': 'ML_Pytorch', 'Q16_Part_5': 'ML_Fast.ai', 
                  'Q16_Part_6': 'ML_MXNet', 'Q16_Part_7': 'ML_XGBoost', 'Q16_Part_8': 'ML_LightGBM', 
                  'Q16_Part_9': 'ML_CatBoost', 'Q16_Part_10': 'ML_Prophet', 'Q16_Part_11': 'ML_H2O 3', 
                  'Q16_Part_12': 'ML_Caret', 'Q16_Part_13': 'ML_TidyModels', 'Q16_Part_14': 'ML_Jax', 
                  'Q16_Part_15': 'ML_PYLightning', 'Q16_Part_16': 'ML_Huggingface', 'Q16_Part_17': 'ML_None', 
                  'Q16_OTHER': 'ML_Other', 'Q17_Part_1': 'Alg_Regress', 'Q17_Part_2': 'Alg_Trees', 
                  'Q17_Part_3': 'Alg_Gradient', 'Q17_Part_4': 'Alg_Bayesian', 'Q17_Part_5': 'Alg_Evolution', 
                  'Q17_Part_6': 'Alg_DenseNeural', 'Q17_Part_7': 'Alg_ConvNeural', 'Q17_Part_8': 'Alg_Generative', 
                  'Q17_Part_9': 'Alg_RecurNeural', 'Q17_Part_10': 'Alg_Transformer', 'Q17_Part_11': 'Alg_None', 
                  'Q17_OTHER': 'Alg_Other', 'Q18_Part_1': 'CV_General', 'Q18_Part_2': 'CV_Segment', 
                  'Q18_Part_3': 'CV_Detect', 'Q18_Part_4': 'CV_Classify', 'Q18_Part_5': 'CV_Generative', 
                  'Q18_Part_6': 'CV_None', 'Q18_OTHER': 'CV_Other', 'Q19_Part_1': 'NLP_Word', 'Q19_Part_2': 'NLP_Models',
                  'Q19_Part_3': 'CV_context', 'Q19_Part_4': 'CV_transformer', 'Q19_Part_5': 'CV_None', 
                  'Q19_OTHER': 'CV_Other', 'Q20': 'Industry', 'Q21': 'No_Employee', 'Q22': 'No.Scientists', 
                  'Q23': 'ML_in_business', 'Q24_Part_1': 'Work_Analyse', 'Q24_Part_2': 'Work_Infrastruct', 
                  'Q24_Part_3': 'Work_Proto', 'Q24_Part_4': 'Work_Service', 'Q24_Part_5': 'Work_Improve', 
                  'Q24_Part_6': 'Work_StateArt', 'Q24_Part_7': 'Work_Other', 'Q24_OTHER': 'Work_Other', 'Q25': 'Salary'}

In [None]:
df.rename(rewritten, axis = 1, inplace = True)

In [None]:
df.head()

In [None]:
# Drop index 0 with the questions
df.drop(labels=0, axis=0, inplace=True)

In [None]:
#We dropped the null values in Recommend First language column. 
#Out of 26000 there were only around 1000 or so nulls, not enough to really warp the predictionns if missing

df = df.dropna(axis=0, subset=['Recommend_first_language'])

In [None]:
df

In [None]:
#We chose India because it has a high amount of entries
df.groupby("Q3").count().head(25)

In [None]:
#capture India
df = df.groupby('Q3').get_group('India')

In [None]:
df.head(25)

In [None]:
df.columns

In [None]:
import pandas as pd 
pd.options.mode.chained_assignment = None # default='warn'

In [None]:
#For each programming language column we turned the text into binary values. 1 for yes and 0 for no
df["Python"] = df["Python"].replace("Python", 1)
df["R"] = df["R"].replace("R", 1)
df["SQL"] = df["SQL"].replace("SQL", 1)
df["C"] = df["C"].replace("C", 1)
df["C++"] = df["C++"].replace("C++", 1)
df["Java"] = df["Java"].replace("Java", 1)
df["Javascript"] = df["Javascript"].replace("Javascript", 1)
df["Julia"] = df["Julia"].replace("Julia", 1)
df["Swift"] = df["Swift"].replace("Swift", 1)
df["Bash"] = df["Bash"].replace("Bash", 1)
df["MATLAB"] = df["MATLAB"].replace("MATLAB", 1)
df["No_Languages"] = df["No_Languages"].replace("No_Languages", 1)
df["Other_Language"] = df["Other_Language"].replace("Other_Language", 1)

In [None]:
df[["Python","R","SQL","C","C++","Java","Javascript","Julia","Swift","Bash","MATLAB","No_Languages","Other_Language"]] = df[["Python","R","SQL","C","C++","Java","Javascript","Julia","Swift","Bash","MATLAB","No_Languages","Other_Language"]].fillna(0)

In [None]:
df

In [None]:
#We removed any column that did not directly have anything to do with our y (Recommend_first_language).
X = df[['Age', 'Gender', 'Q3', 'Education', 'Job', 'Experience', 'Python', 'R',
       'SQL', 'C', 'C++', 'Java', 'Javascript', 'Julia', 'Swift', 'Bash',
       'MATLAB', 'No_Languages', 'Other_Language', 'Recommend_first_language','Salary']]

In [None]:
X

In [None]:
X.isna().sum()

In [None]:
#change dtype of 1/0 to int not float
cols = ['Python', 'R','SQL', 'C', 'C++', 'Java', 'Javascript', 
   'Julia', 'Swift', 'Bash','MATLAB', 'No_Languages', 'Other_Language']


In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
X['Python']=X['Python'].astype(int)
X['R']=X['R'].astype(int)
X['SQL']=X['SQL'].astype(int)
X['C']=X['C'].astype(int)
X['C++']=X['C++'].astype(int)
X['Java']=X['Java'].astype(int)
X['Javascript']=X['Javascript'].astype(int)
X['Julia']=X['Julia'].astype(int)
X['Swift']=X['Swift'].astype(int)
X['Bash']=X['Bash'].astype(int)
X['MATLAB']=X['MATLAB'].astype(int)

In [None]:
X

#### For Salary, we replaced the ranges with the average of that specific range. This was done by using regex functions to split our range into two separate numbers and then averaging them together. We then replaced rows with null values with the mean average for the column itself.

In [None]:
salary = X['Salary']

In [None]:
salary = (salary
 .astype(str).str.split('-', expand=True)
)

In [None]:
salary = salary.replace({'\$':''}, regex = True)

In [None]:
salary = salary.replace({'\,':''}, regex = True)

In [None]:
salary = salary.replace({'>':''}, regex = True)

In [None]:
salary = salary.replace({'<':''}, regex = True)

In [None]:
salary

In [None]:
salary = salary.astype(float).mean(axis=1)

In [None]:
X["Salary"] = salary

In [None]:
X = X.drop('Q3', 1)

In [None]:
X.head(10)

In [None]:
X['Salary'].fillna(int(X['Salary'].mean()), inplace=True)

In [None]:
y = X['Recommend_first_language']

In [None]:
#X = X.drop('Salary', 1)

In [None]:
#Use the null values in Recommended first language as my test set

In [None]:
X.head()

In [None]:
X['Age'].value_counts()

In [None]:
X['Gender'] = X['Gender'].replace({'Nonbinary':'Other','Prefer to self-describe':'Other'})

In [None]:
X['Education'] = X['Education'].replace({'Some college/university study without earning a bachelor’s degree':'Some College','I prefer not to answer':'No Response', 'No formal education past high school':'Highschool'})

In [None]:
X['Age'] = X['Age'].replace({'18-21':'18-24','22-24':'18-24','25-29':'25-34','30-34':'25-34','35-39':'35-44','40-44':'35-44','45-49':'45-54','50-54':'45-54','55-59':'55+','60-69':'55+','70+':'55+'})

In [None]:
X

## Data Visualization

### In this section, we take a look at our now processed data, starting with the correlation between our feature variables

In [None]:
corrmat = X.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(X[top_corr_features].corr(),annot=True,cmap="RdYlGn")

#### We see that some of the features have negative correlations with eachother (ie. Gender and Experience, Matlab and Salary). This shows that there are some inverse reactions/correlations between our features. 

#### Now lets continue, when looking at the data for our survey respondents in India, most of them held either a Bachelor's or a Master's degree:

In [None]:
#This shows the distribution of Degrees in India. Based on the plot, most of the survey respondents had at least a bachelor's degree. 
#With the second largest group had at least a master's degree.

sns.set_theme(style="whitegrid", font_scale = 0.9)
ax = sns.countplot(x="Education" , data=X)
ax.tick_params(axis='x', rotation=90)

#### Our Age distribution shows that more than half our data comes between the age groups 18-24 and 25-34

In [None]:
#This shows the distribution of Ages in India. Based on the plot, most of the survey respondents were between the ages of 18-25.

sns.set_theme(style="whitegrid", font_scale = 0.9)
ax = sns.countplot(x="Age" , data=X)
ax.tick_params(axis='x', rotation=90)

In [None]:
age = X.groupby("Age")["Recommend_first_language"].size()
age.plot.pie(autopct="%.1f%%");

#### We compared Gender to Recomended First Language.
- 0 = Man
- 1 = Woman
- 2 = Other
- 3 = Prefer not to Say

#### For men, the distribution of language recommendation was pretty even across the board. However as you move on to Women or Other, they tend to recomend less languages.

In [None]:
sns.catplot(x="Gender", y="Recommend_first_language", data=X)

#### This line plot focuses on the correlation between Experience, Salary, and Recommended First Language
- 0 = I have never written code
- 1 = < 1 years and 3-5 years (entry level)
- 2 = 5-10 years and 10-20 years (mid level to senior)
- 3 = 20+ years (senior to executive)

#### Each language is represented by a different color and line pattern. For example, many entry level applicants seem to recomend Java whereas people with a lot more experience recommended more complex languages like Julia and C.

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.lineplot(data=X, x="Experience", y="Salary", hue="Recommend_first_language", style="Recommend_first_language")

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.lineplot(data=X, x="Education", y="Age", hue="Recommend_first_language", style="Recommend_first_language")

## Predictive Model Creation
We will implement 3 models to predict the Recommended First Language and compare

### Pre-Processing Part Two:

We use get_dummies in order to make our entire X dataset binary for our models. This increases our attributes significantly however it's necessary in order for our models to work (we are willing to tradeoff for a lower accuracy score)

In [None]:
X['Gender'] = X['Gender'].replace({'Other':3, 'Man':0, 'Woman':1,'Prefer not to say':2 })
X.head()

In [None]:
X['Experience'] = X['Experience'].replace({'I have never written code':0,'< 1 years':0,'1-3 years':1,'3-5 years':1,'5-10 years':2,'10-20 years':2,'20+ years':3})

In [None]:
X = X.drop('Recommend_first_language', 1)

In [None]:
X = pd.get_dummies(X)

In [None]:
X.columns

In [None]:
X

Splitting the data into our train and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Decision Tree Classifier and Feature Importance

In [None]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
importance = clf.feature_importances_

In [None]:
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

In [None]:
y_pred

In [None]:
from sklearn import tree

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render('final')

#### Accuracy Check on our Decision Tree Model

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [None]:
acc_score = accuracy_score(y_test, y_pred)

In [None]:
acc_score

In [None]:
class_repo = classification_report(y_test, y_pred, output_dict=True)

In [None]:
repoDf = pd.DataFrame(class_repo).transpose()

In [None]:
repoDf

### MLP Classifier

In [None]:
#import MLP Classifier
from sklearn.neural_network import MLPClassifier
from sklearn.exceptions import ConvergenceWarning

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(500,500,500,500,500), 
                    activation='logistic',solver='sgd', learning_rate_init=0.3, random_state=1, shuffle=False)

In [None]:
import time 
startTime = time.time()
results = mlp.fit(X,y)
endTime = time.time()
print('model was trained in', round(endTime-startTime,2), 'seconds.')

In [None]:
coefs = results.coefs_[0]

#### Accuracy of our MLP Classifier

In [None]:
clf.score(X_test, y_test)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB,MultinomialNB,GaussianNB
#modelNB = GaussianNB()
modelNB = BernoulliNB()
#modelNB = MultinomialNB()

modelNB.fit(X_train,y_train)

predicted_test = modelNB.predict(X_test)

In [None]:
predicted_test

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score

cm = confusion_matrix (y_test, predicted_test)

#### Accuracy of our Naive Bayes Model

In [None]:
print(accuracy_score(y_test,predicted_test))

### CONCLUSION

When looking at all of our models, all of them maintained an accuracy greater than 75%, with our Naive Bayes model holding the highest accuracy percentage at 82%.

This is original work done by Briauna Brown and Matthew  Geis