# How to break into the field
In this notebook I will follow along the nb from Udacity course and practice some basic data wrangling. <br>
**Question: What do the survey takers advise to break into the field of software development?**

In [1]:
import numpy as np
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('survey-results-public.csv')
df.head(2)

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0


To answer the question, we need to check out the CousinEducation column in the schema.csv file.

In [11]:
dfs = pd.read_csv('survey-results-schema.csv')
list(dfs[dfs.Column == 'CousinEducation']['Question'])

["Let's pretend you have a distant cousin. They are 24 years old, have a college degree in a field not related to computer programming, and have been working a non-coding job for the last two years. They want your advice on how to switch to a career as a software developer. Which of the following options would you most strongly recommend to your cousin?\nLet's pretend you have a distant cousin named Robert. He is 24 years old, has a college degree in a field not related to computer programming, and has been working a non-coding job for the last two years. He wants your advice on how to switch to a career as a software developer. Which of the following options would you most strongly recommend to Robert?\nLet's pretend you have a distant cousin named Alice. She is 24 years old, has a college degree in a field not related to computer programming, and has been working a non-coding job for the last two years. She wants your advice on how to switch to a career as a software developer. Which

In [21]:
# Just checking what portion of this column is NaN.
df.CousinEducation.isnull().mean()

0.5414072229140723

In [32]:
# Now let's check the distribution of different answers.
# Note that reset_index() converts pd series to dataframe
study = df['CousinEducation'].value_counts().reset_index()
study.head()

Unnamed: 0,index,CousinEducation
0,Take online courses; Buy books and work throug...,711
1,Take online courses,551
2,None of these,523
3,Take online courses; Part-time/evening courses...,479
4,Take online courses; Bootcamp; Part-time/eveni...,465


Looks like this is a multiple choice question with many possible combinations, separated by ; <br>
Let's try to clean this up.


In [33]:
# First let's change the oolumn names
study.rename(columns = {'index': 'Method', 'CousinEducation': 'Count'}, inplace = True)
study.head()

Unnamed: 0,Method,Count
0,Take online courses; Buy books and work throug...,711
1,Take online courses,551
2,None of these,523
3,Take online courses; Part-time/evening courses...,479
4,Take online courses; Bootcamp; Part-time/eveni...,465


### Write a function to clean up the data in the df study
Here I will try to write a function which is more efficient than the example provided in the course material. Instead of searching for all the possible methods in each row of the df, I will loop through all the rows in the df, at each row convert the 'Method' str into a list of different methods. Then go through that list, and += the count to the corresponding method in the dict.

In [48]:
def get_count(df, cola, colb):
    """
    INPUT: 
    df - The dataframe you want to search counts from
    cola - (str) The name of the col where the keys (categories) are
    colb - (str) The nmae of the col where the count numbers are
    
    OUTPUT:
    count - A dataframe showing the count number of each key (category)
    
    """
    count = defaultdict(int)
    for i in range(df.shape[0]):
        s = df[cola][i]
        keys = s.split('; ')
        for key in keys:
            key = key.strip()
            count[key] += df[colb][i]
    count = pd.DataFrame(pd.Series(count)).reset_index()
    count.columns = [cola, colb]
    count.sort_values(colb, ascending = False, inplace = True)
    
    return count 

In [50]:
study = get_count(study, 'Method', 'Count')
study.head()

Unnamed: 0,Method,Count
0,Take online courses,15246
1,Buy books and work through the exercises,11750
3,Part-time/evening courses,7517
7,Contribute to open source,7423
4,Bootcamp,5276
