## 1. Introduction

### What is a Data Scientist?

(1) mathematics and statistics skills
allows them to identify interesting insights in a sea of data. 

(2) programming skills
can code up statistical models and get data from a variety of different data sources

(3) Quantitative Expertise
knows how to ask the right questions and translate those questions into a sound
analysis

(4) Communication Skills
can report their findings in a way that people can easily understand. In other words, data

IN A NUTSHELL...
Data Scientists have the ability to perform complicated analysis on huge data sets. Once they've done this, they also have the ability to write and make informative graphs to communicate their findings to others

### What does a Data Scientist Do

What does it mean for a data scientist to have 'substantive expertise' and why is it important?

- Knows which questions to ask
- Can interpret the data well
- Understands structure of the data
- Data Scientists often work in teams

#### Simpson's Paradox

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('./Berkeley_Graduate_1973Admission.csv')
# Admission data for six largest departments

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 4 columns):
Admit     24 non-null object
Gender    24 non-null object
Dept      24 non-null object
Freq      24 non-null int64
dtypes: int64(1), object(3)
memory usage: 848.0+ bytes


In [6]:
df.head()

Unnamed: 0,Admit,Gender,Dept,Freq
0,Admitted,Male,A,512
1,Rejected,Male,A,313
2,Admitted,Female,A,89
3,Rejected,Female,A,19
4,Admitted,Male,B,353


In [12]:
df_admit_gender = \
df.groupby(['Admit', 'Gender'], as_index = False).sum()

In [13]:
df_admit_gender

Unnamed: 0,Admit,Gender,Freq
0,Admitted,Female,557
1,Admitted,Male,1198
2,Rejected,Female,1278
3,Rejected,Male,1493


In [32]:
# Overall Acceptance rate by gender
print("Male acceptance rate : ", 1198 * 100 / (1198 + 1493), '%')
print("Female acceptance rate: ", 557 * 100/ (557 + 1278), '%')

Male acceptance rate :  44.518766257896694 %
Female acceptance rate:  30.354223433242506 %


In [87]:
# But if we look at each department...
df_dept = df.groupby(['Dept', 'Gender'], as_index=False).sum()

df_dept.head()

Unnamed: 0,Dept,Gender,Freq
0,A,Female,108
1,A,Male,825
2,B,Female,25
3,B,Male,560
4,C,Female,593


In [93]:
df_dept.columns =\
['Dept', 'Gender', 'Total Applicants']

In [95]:
df_dept.head()

Unnamed: 0,Dept,Gender,Total Applicants
0,A,Female,108
1,A,Male,825
2,B,Female,25
3,B,Male,560
4,C,Female,593


In [96]:
df.head()

Unnamed: 0,Admit,Gender,Dept,Freq
0,Admitted,Male,A,512
1,Rejected,Male,A,313
2,Admitted,Female,A,89
3,Rejected,Female,A,19
4,Admitted,Male,B,353


In [99]:
df_dept_admit_by_gender = \
df_dept.merge(df, on=['Dept', 'Gender'], how='left')

In [100]:
# drop rows with reject students

df_dept_admit_by_gender =\
df_dept_admit_by_gender.groupby(['Dept', 'Gender']).first()

In [107]:
# drop 'Admit' column
df_dept_admit_by_gender = \
df_dept_admit_by_gender.drop(labels= 'Admit', axis=1)

In [109]:
df_dept_admit_by_gender['Admission Rate'] = \
df_dept_admit_by_gender['Freq'] * 100 / df_dept_admit_by_gender['Total Applicants']

In [117]:
df_dept_admit_by_gender.stack().unstack(1).unstack()

Gender,Female,Female,Female,Male,Male,Male
Unnamed: 0_level_1,Total Applicants,Freq,Admission Rate,Total Applicants,Freq,Admission Rate
Dept,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,108.0,89.0,82.407407,825.0,512.0,62.060606
B,25.0,17.0,68.0,560.0,353.0,63.035714
C,593.0,202.0,34.064081,325.0,120.0,36.923077
D,375.0,131.0,34.933333,417.0,138.0,33.093525
E,393.0,94.0,23.918575,191.0,53.0,27.748691
F,341.0,24.0,7.038123,373.0,22.0,5.898123


we can see from results above that female admission rate was higher in department A, B, D and F (4 departments out of 6 ). 

##### This is called "Simpson's paradox"

Simpson's paradox, or the Yule–Simpson effect, is a paradox in probability and statistics, in which a trend appears in different groups of data but disappears 
or reverses when these groups are combined.

The admission figures for the fall of 1973 showed that men applying were more likely than women to be admitted. But when examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas only four were significantly biased against women. In fact, the pooled and corrected data showed a "small but statistically significant bias in favor of women.

#### How can we solve real world problems with data science?

Data science can solve problems you'd expect...

- Netflix : Collaborative filtering algorithms based on what the users have previously watched
- Social Media : Recommending new connections in Linkedin, Constructing FB newsfeed, Suggesting new people to follow
- Web Apps : OkCupid, Uber etc.

And a ton more you might not expect...

- Bioinformatics : Annotating genomes, Analyzing data sequences
- Urban Planning : Resolving Chicago's bus crowding issue using data
- Astrophysics
- Public Health : Analayzing electro-medical records targetting efforts towrads specific buildings accounting for a majority of emergency admission
- Sports : sports view camera in NBA; huge amounts of data on players' movement and playing styles, leading to better coaching decisions and helping better analysis of game trends

### Numpy & Pandas

In [121]:
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
olympic_medal_counts = {'country_name': pd.Series(countries),
                            'gold': pd.Series(gold),
                            'silver': pd.Series(silver),
                            'bronze': pd.Series(bronze)}
df_olymp_medal = pd.DataFrame(olympic_medal_counts)

In [122]:
df_olymp_medal.head()

Unnamed: 0,bronze,country_name,gold,silver
0,9,Russian Fed.,13,11
1,10,Norway,11,5
2,5,Canada,10,10
3,12,United States,9,7
4,9,Netherlands,8,7


In [125]:
avg_bronze_at_least_one_gold = \
df_olymp_medal[df_olymp_medal.gold >= 1].bronze.mean()

avg_bronze_at_least_one_gold

4.238095238095238

Q.) Using the dataframe's apply method, create a new Series called avg_medal_count that indicates the average number of gold, silver, and bronze medals earned amongst countries who earned at least one medal of any kind at the 2014 Sochi olympics.  Note that the countries list already only includes countries that have earned at least one medal. No additional filtering is necessary.

In [134]:
df_olymp_medal.loc[:, ('gold','silver','bronze')].apply(np.mean)

gold      3.807692
silver    3.730769
bronze    3.807692
dtype: float64

Q.) Imagine a point system in which each country is awarded 4 points for each gold medal,  2 points for each silver medal, and 1 point for each bronze medal. Using the numpy.dot function, create a new dataframe called 'olympic_points_df' that includes:       

a) a column called 'country_name' with the country name

b) a column called 'points' with the total number of points the country earned at the Sochi olympics.

In [136]:
olympic_points_df = df_olymp_medal.copy()

In [137]:
olympic_points_df.head()

Unnamed: 0,bronze,country_name,gold,silver
0,9,Russian Fed.,13,11
1,10,Norway,11,5
2,5,Canada,10,10
3,12,United States,9,7
4,9,Netherlands,8,7


In [138]:
olympic_points_df['points'] =\
olympic_points_df[['bronze','gold','silver']].\
apply(lambda x: x[0] + x[1]*4 + x[2]*2, axis=1)

dot notation 사용하는 또 다른 방법!

olympic_points_df['points'] = pd.Series(np.dot(olympic_points_df[['bronze','gold','silver']], [4, 2, 1]))

In [139]:
olympic_points_df

Unnamed: 0,bronze,country_name,gold,silver,points
0,9,Russian Fed.,13,11,83
1,10,Norway,11,5,64
2,5,Canada,10,10,65
3,12,United States,9,7,62
4,9,Netherlands,8,7,55
5,5,Germany,8,6,49
6,2,Switzerland,6,3,32
7,1,Belarus,5,0,21
8,5,Austria,4,8,37
9,7,France,4,4,31
