# StarterHacks 2020 Intro to Data Analytics Workshop 

The workshop is meant to make students interested in Data Analytics. 

To accomplish this goal:

- We use a dataset that some people are likely to be interested in working with (UW student resume dataset) 
- We avoid Data Cleaning since it's hard to communicate impact of data cleaning. Instead, we teach  "Data Validation" and try to motivate why it's important 
- Rather than focus on breadth or depth, this workshop focuses on minimal content and create many small exercises that people code up on their own 
- While package management is an important part of Data Science related jobs, this workshop abstracts that away.  Installation issues may cause resentment towards this profession. So, everything is hosted on JupyterHub (AWS) to minimize user-side installation 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression 

In [None]:
# Automated  Basic Cleaning
# All of this should be baked into the dataset before distributing it to students  

df = pd.read_csv("Resume.csv").applymap(lambda s:s.lower().strip() if type(s) == str else s)
df = df[df["startyear"] != 2011]
df["skills"] = df["skills"].str.replace(" ", "")

def isUs(country):
    return type(country) is str and "canada" not in country and (", ca" in country or ", wa" in country or ", ny" in country or ", nj" in country)

# Data Tagging whether internships are from USA or not 
df["is_us_1"] = df["location1"].map(isUs)
df["is_us_2"] = df["location2"].map(isUs)
df["is_us_3"] = df["location3"].map(isUs)
df["is_us_4"] = df["location4"].map(isUs)
df["is_us_5"] = df["location5"].map(isUs)
df["is_us_6"] = df["location6"].map(isUs)

# Remove newlines in role descriptions
df["roles1"] = df["roles1"].str.replace("\n", " ").str.replace("  ", " ")
df["roles2"] = df["roles2"].str.replace("\n", " ").str.replace("  ", " ")
df["roles3"] = df["roles3"].str.replace("\n", " ").str.replace("  ", " ")
df["roles4"] = df["roles4"].str.replace("\n", " ").str.replace("  ", " ")
df["roles5"] = df["roles5"].str.replace("\n", " ").str.replace("  ", " ")
df["roles6"] = df["roles6"].str.replace("\n", " ").str.replace("  ", " ")

# Construct # of internship Field
df["numcoops"] = 1
df.loc[~df["company2"].isnull(), "numcoops"] = 2
df.loc[~df["company3"].isnull(), "numcoops"] = 3
df.loc[~df["company4"].isnull(), "numcoops"] = 4
df.loc[~df["company5"].isnull(), "numcoops"] = 5
df.loc[~df["company6"].isnull(), "numcoops"] = 6
df["numcoops_noisy"] = df["numcoops"] + np.random.normal(0, 0.1, len(df))
df["startyear_noisy"] = df["startyear"] + np.random.normal(0, 0.1, len(df))

df.loc[df["avggpa"] < 5, "gpa"] = df["avggpa"]
df.loc[df["avggpa"] > 5, "pctgrades"] = df["avggpa"]
del df["avggpa"]
df.to_csv("resume2.csv") 

In [None]:
_df = pd.DataFrame(data =  {'col1': [None, "I", "like", "cake", "and", "pie"], 
                           'col2': [1, 1, None, 1, 1, None],
                           'col3': [2, 1, 2, 1, 2, 1], 
                           'col4': [1, None, 3, 4, 5, 6],
                           'col5': [1, 2, 4, None, 16, 32]})
_df.to_csv("testdata1.csv",index=False)

### Demo 1: Intro to Jupyter Notebook Environment & DataFrame Operations

Students should learn to use:
- Column selection eg: df["col3"]
- Series.describe() method 
- Series.column

In [None]:
_df["col3"]
_df["col3"].describe()
_df.columns

### Task 1: Data Validation & Sanity Checking

In [None]:
df.columns
gpa = df["gpa"]
gpa.describe()

In [None]:
grades = df["pctgrades"]
grades.describe()

### Takeaways from task 1
- Don't use GPA or Average in your analysis. There isn't enough data on people with <80 average 

### Demo 2: Series Operations & Intro to MatplotLib

Students should learn to
- Plot scatter plots on MatPlotLib 
- Add axis labels to plot 

In [None]:
plt.scatter(_df["col4"], _df["col5"])
plt.show()

### Task 2: More Data Validation via Plotting

This task has been cut due to time constraints 

In [None]:
df["numcoops"].describe()

In [None]:
has_start_year = df[~df["startyear"].isnull()]

X = has_start_year["startyear"]
Y = has_start_year["numcoops"]
plt.scatter( X, Y)
# Linear regression is too hard, don't bother with trying to explain it or demo it 
# plt.plot( X.reshape(-1,1), LinearRegression().fit(X.reshape(-1,1), Y.reshape(-1,1)).predict(X.reshape(-1,1)) )
plt.xlabel("Start Year")
plt.ylabel("# of Internships")

plt.show()

### Takeaways From Task 2:

- Generally, you would have had more internships the earlier you started University. This is in line with our expectations. So NumInternships and StartYear fields are probably okay to use in our analysis


### Demo 3 - Concatenating Series & counting
Students should learn how to 
- Perform element-wise addition on Series
- Append 2 series together
- Count the number of elements in the series with .value_counts()

In [None]:
_df["col4"]
_df["col5"]
_df["col4"] + _df["col5"]
_df["col4"].append(_df["col5"])
_df["col3"].value_counts()

### Task 3 - What companies hires the most UW kids?

In [None]:
companies = df["company1"]
companies = companies.append(df["company2"])
companies = companies.append(df["company3"])
companies = companies.append(df["company4"])
companies = companies.append(df["company5"])
companies = companies.append(df["company6"])
companies.value_counts().head(10)

### Takeaways from task 3:
If you are applying on an off-term, these are the companies you probably want to apply to... 

### Demo 4 - Subset Selection 

Students should learn how to
- Select subsets of a dataframe
- Perform operations on subsets of dataframes (and how they still behave like dataframes) 

In [None]:
_df["col1"].isnull()
~_df["col1"].isnull()
_df["col3"] == 2
_df["col3"] == 2 & _df["col2"].isnull()

_df[_df["col3"] == 2]
_df[_df["col3"] == 2]["col4"]

### Task 4:  What places hire a lot of first years?


In [None]:
one_coops = df[df["numcoops"] == 1]["company1"]
two_coops = df[df["numcoops"] == 2]["company2"]
three_coops = df[df["numcoops"] == 3]["company3"]
four_coops = df[df["numcoops"] == 4]["company4"]
five_coops = df[df["numcoops"] == 5]["company5"]
six_coops = df[df["numcoops"] == 6]["company6"]

In [None]:
one_coops.append(two_coops).append(three_coops).append(four_coops).append(five_coops).append(six_coops).value_counts()

### Takeaway from task 4: 
- You probably won't get cali if you're first coop
- Shopify is pretty cool, apply there 

### Task 5:  What places hire a lot of interns who return?

This task has been cut due to time 

In [None]:
return1 = df[df["company1"] == df["company2"]]["company1"]
return2 = df[df["company2"] == df["company3"]]["company2"]
return3 = df[df["company3"] == df["company4"]]["company3"]
return4 = df[df["company4"] == df["company5"]]["company4"]
return5 = df[df["company5"] == df["company6"]]["company5"]
return1.append(return2).append(return3).append(return4).append(return5).value_counts()

### Demo 5 - Summing in Pandas 
Students should learn:

- How to sum in Pandas
- How to sort by values in a series
- Chaining & Optional parameters & the importance of reading documentation

In [None]:
# preprocess datasets - this command will not be shown
skills = df["skills"].fillna("")
ranked_skills_list = skills.apply(lambda x: pd.value_counts(x.split(",")) )
ranked_skills_list.drop("", axis=1, inplace=True)
ranked_skills_list.to_csv("skills_data.csv", index=False)



In [None]:
ranked_roles_list = pd.read_csv("skills_data.csv")

skills_summed = ranked_skills_list.sum(axis = 1)
skills_summed = ranked_skills_list.sum(axis = 0)
skills_summed = ranked_skills_list.sum(axis = 0).sort_values(ascending=False)
skills_summed.head(20)


### Task 6 - What skills do a lot of people use on their jobs?

In [None]:
# PreProcess - Will not be demo'd  

roles_list = pd.Series()
for idx, r in df.iterrows():
    roles = str(r["roles1"])
    if r["numcoops"] > 2:
        roles = roles + " " + str(r["roles2"])
    if r["numcoops"] > 3:
        roles = roles + " " + str(r["roles3"])
    if r["numcoops"] > 4:
        roles = roles + " " + str(r["roles4"])
    if r["numcoops"] > 5:
        roles = roles + " " + str(r["roles5"])
    if r["numcoops"] > 6:
        roles = roles + " " + str(r["roles6"])
    
    words = roles.split(" ")
    sk = " "
    for w in words:
        w = w.strip()
        if w in ranked_skills_list and w != "":
            sk = sk + "," + w
    roles_list.set_value(idx, sk[1:])
ranked_roles_list = roles_list.apply(lambda x: pd.value_counts(x.split(",")) )
ranked_roles_list.drop("", axis=1, inplace=True)
ranked_roles_list.to_csv("roles_data.csv", index=False)

In [None]:
ranked_roles_list = pd.read_csv("roles_data.csv")

duties_summed = ranked_roles_list.sum(axis = 0).sort_values(ascending=False)
# Maybe a more simpler version?
#ranked_skills_list = pd.Series(skills.str.cat(sep=",").split(",")).value_counts()
duties_summed.head(20)

### Takeaways from Task 7

- iOS and Android dev is useful for many jobs, but many people do not have it as a skill. Mobile dev may be a good niche to get into if you want the least competiton when applying for jobs ;) 
- For those interested in data, Spark may  be a good framework to learn 
- C++ is known by many people, but not many jobs are using C++. The oppositie is true for Golang
- Python is quite a popular language to learn and also use on the job 
- Docker doesnt seem that popular in terms of co-op jobs (may imply DevOps jobs are rare). Validate this hypothesis? 