# 데이터 분석 시작

## 소개
이 장은 새로운 데이터셋을 처음으로 조사할 때 수행해야 할 과제들에 대해 알아보는 것으로 시작합니다.

## 데이터 분석 루틴 개발
데이터 분석에 대한 표준적인 방식은 없지만 대개 데이터셋을 처음 관찰할 때 수행할 일련의 루틴을 개발해두는 것이 좋습니다. 이는 새로운 데이터셋에 보다 빨리 익숙해지는 데 도움이 됩니다.

### 준비단계
**탐색적 데이터 분석(Exploratory Data Analysis, EDA)**는 데이터를 분석하는 모든 절차를 통칭해 부르는 용어입니다. 이 레시피는 EDA 중 작지만 근본적인 부분인 **메타 데이터(metadata)**의 수집과 단일 변량 기술 통계량을 루틴하고 체계적인 방법을 사용합니다. 여기서는 데이터셋을 DataFrame으로 처음 임포트 할 때 수행해야 할 일반적이고 공통적인 과제를 보여줍니다.
> 메타 데이터는 데이터셋을 설명하는 데이터입니다. 메타 데이터의 종류에는 열/행의 개수, 열 이름, 각 열의 데이터 형식, 데이터셋의 소스, 수집일, 서로 다른 열의 허용 가능 값 등이 있습니다.

> 일변량 기술 통계량은 데이터셋의 개별 변수(열)에 관한 요약 통계량이며, 다른 모든 변수에 대해 독립적이다.

### 방법
먼저 대학 데이터셋에 대한 메타 데이터를 수집하고 각 열에 대해 기본적 요약 통계량을 구합니다.

In [2]:
import pandas as pd
import numpy as np

In [3]:
college = pd.read_csv('data/college.csv')
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [5]:
college.shape

(7535, 27)

In [6]:
college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM                7535 non-null object
CITY                  7535 non-null object
STABBR                7535 non-null object
HBCU                  7164 non-null float64
MENONLY               7164 non-null float64
WOMENONLY             7164 non-null float64
RELAFFIL              7535 non-null int64
SATVRMID              1185 non-null float64
SATMTMID              1196 non-null float64
DISTANCEONLY          7164 non-null float64
UGDS                  6874 non-null float64
UGDS_WHITE            6874 non-null float64
UGDS_BLACK            6874 non-null float64
UGDS_HISP             6874 non-null float64
UGDS_ASIAN            6874 non-null float64
UGDS_AIAN             6874 non-null float64
UGDS_NHPI             6874 non-null float64
UGDS_2MOR             6874 non-null float64
UGDS_NRA              6874 non-null float64
UGDS_UNKN             6874 non-null float64
PPTUG_EF          

In [7]:
college.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


In [10]:
college.describe(include=[np.object, pd.Categorical]).T

Unnamed: 0,count,unique,top,freq
INSTNM,7535,7535,Texas Woman's University,1
CITY,7535,2514,New York,87
STABBR,7535,59,CA,773
MD_EARN_WNE_P10,6413,598,PrivacySuppressed,822
GRAD_DEBT_MDN_SUPP,7503,2038,PrivacySuppressed,1510


In [12]:
college.describe(include=[np.number], percentiles=[.01, .05, .10]).T

Unnamed: 0,count,mean,std,min,1%,5%,10%,50%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,390.0,430.0,447.4,510.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,395.0,430.0,453.0,520.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,14.0,31.65,49.0,412.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.0,0.013265,0.06879,0.5557,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.0,0.0,0.00753,0.10005,1.0


In [14]:
pd.read_csv('data/college_data_dictionary.csv')

Unnamed: 0,column_name,description
0,INSTNM,Institution Name
1,CITY,City Location
2,STABBR,State Abbreviation
3,HBCU,Historically Black College or University
4,MENONLY,0/1 Men Only
5,WOMENONLY,0/1 Women only
6,RELAFFIL,0/1 Religious Affiliation
7,SATVRMID,SAT Verbal Median
8,SATMTMID,SAT Math Median
9,DISTANCEONLY,Distance Education Only


In [15]:
different_cols = ['RELAFFIL', 'SATMIMID', 'CURROPER', 'INSTNM', 'STABBR']
col2 = college.loc[:, different_cols]
col2.head()

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,RELAFFIL,SATMIMID,CURROPER,INSTNM,STABBR
0,0,,1,Alabama A & M University,AL
1,0,,1,University of Alabama at Birmingham,AL
2,1,,1,Amridge University,AL
3,0,,1,University of Alabama in Huntsville,AL
4,0,,1,Alabama State University,AL


In [16]:
col2.dtypes

RELAFFIL      int64
SATMIMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

In [18]:
original_mem = col2.memory_usage(deep=True)
original_mem

Index           80
RELAFFIL     60280
SATMIMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

In [19]:
col2['RELAFFIL'] = col2['RELAFFIL'].astype(np.int8)

In [21]:
col2.dtypes

RELAFFIL       int8
SATMIMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

In [22]:
col2.memory_usage(deep=True)

Index           80
RELAFFIL      7535
SATMIMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

In [23]:
col2.select_dtypes(include=['object']).nunique()

INSTNM    7535
STABBR      59
dtype: int64

In [24]:
col2['STABBR'] = col2['STABBR'].astype('category')
col2.dtypes

RELAFFIL        int8
SATMIMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object

In [25]:
col2.memory_usage(deep=True)

Index           80
RELAFFIL      7535
SATMIMID     60280
CURROPER     60280
INSTNM      660699
STABBR       13576
dtype: int64

In [28]:
new_mem = col2.memory_usage(deep=True)

In [29]:
new_mem / original_mem

Index       1.000000
RELAFFIL    0.125000
SATMIMID    1.000000
CURROPER    1.000000
INSTNM      1.000695
STABBR      0.030538
dtype: float64

In [30]:
movie = pd.read_csv('data/movie.csv')
movie2 = movie[['movie_title', 'imdb_score', 'budget']]
movie2.head()

Unnamed: 0,movie_title,imdb_score,budget
0,Avatar,7.9,237000000.0
1,Pirates of the Caribbean: At World's End,7.1,300000000.0
2,Spectre,6.8,245000000.0
3,The Dark Knight Rises,8.5,250000000.0
4,Star Wars: Episode VII - The Force Awakens,7.1,


In [31]:
movie2.nlargest(100, 'imdb_score').head()

Unnamed: 0,movie_title,imdb_score,budget
2725,Towering Inferno,9.5,
1920,The Shawshank Redemption,9.3,25000000.0
3402,The Godfather,9.2,6000000.0
2779,Dekalog,9.1,
4312,Kickboxer: Vengeance,9.1,17000000.0


In [32]:
movie2.nlargest(100, 'imdb_score').nsmallest(5, 'budget')

Unnamed: 0,movie_title,imdb_score,budget
4804,Butterfly Girl,8.7,180000.0
4801,Children of Heaven,8.5,180000.0
4706,12 Angry Men,8.9,350000.0
4550,A Separation,8.4,500000.0
4636,The Other Dream Team,8.4,500000.0


In [37]:
movie = pd.read_csv('data/movie.csv')
movie2 = movie[['movie_title', 'title_year', 'imdb_score']]

In [38]:
movie2.sort_values('title_year', ascending=False).head()

Unnamed: 0,movie_title,title_year,imdb_score
3884,The Veil,2016.0,4.7
2375,My Big Fat Greek Wedding 2,2016.0,6.1
2794,Miracles from Heaven,2016.0,6.8
92,Independence Day: Resurgence,2016.0,5.5
153,Kung Fu Panda 3,2016.0,7.2


In [39]:
movie3 = movie2.sort_values(['title_year', 'imdb_score'], ascending=False)
movie3.head()

Unnamed: 0,movie_title,title_year,imdb_score
4312,Kickboxer: Vengeance,2016.0,9.1
4277,A Beginner's Guide to Snuff,2016.0,8.7
3798,Airlift,2016.0,8.5
27,Captain America: Civil War,2016.0,8.2
98,Godzilla Resurgence,2016.0,8.2


In [40]:
movie_top_year = movie3.drop_duplicates(subset='title_year')
movie_top_year.head()

Unnamed: 0,movie_title,title_year,imdb_score
4312,Kickboxer: Vengeance,2016.0,9.1
3745,Running Forever,2015.0,8.6
4369,Queen of the Mountains,2014.0,8.7
3935,"Batman: The Dark Knight Returns, Part 2",2013.0,8.4
3,The Dark Knight Rises,2012.0,8.5


In [42]:
movie = pd.read_csv('data/movie.csv')
movie2 = movie[['movie_title', 'imdb_score', 'budget']] 
movie_smallest_largest = movie2.nlargest(100, 'imdb_score').nsmallest(5, 'budget')
movie_smallest_largest

Unnamed: 0,movie_title,imdb_score,budget
4804,Butterfly Girl,8.7,180000.0
4801,Children of Heaven,8.5,180000.0
4706,12 Angry Men,8.9,350000.0
4550,A Separation,8.4,500000.0
4636,The Other Dream Team,8.4,500000.0


In [43]:
movie2.sort_values('imdb_score', ascending=False).head()

Unnamed: 0,movie_title,imdb_score,budget
2725,Towering Inferno,9.5,
1920,The Shawshank Redemption,9.3,25000000.0
3402,The Godfather,9.2,6000000.0
2779,Dekalog,9.1,
4312,Kickboxer: Vengeance,9.1,17000000.0


In [50]:
pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data, wb
import fix_yahoo_finance as yf
yf.pdr_override()
import datetime

In [57]:
start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 12, 31)
tsla = data.get_data_yahoo('tsla', start, end)
tsla

[*********************100%***********************]  1 of 1 downloaded


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-03,214.860001,220.330002,210.960007,216.990005,216.990005,5923300
2017-01-04,214.750000,228.000000,214.309998,226.990005,226.990005,11213500
2017-01-05,226.419998,227.479996,221.949997,226.750000,226.750000,5911700
2017-01-06,226.929993,230.309998,225.449997,229.009995,229.009995,5527900
2017-01-09,228.970001,231.919998,228.000000,231.279999,231.279999,3957000
2017-01-10,232.000000,232.000000,226.889999,229.869995,229.869995,3660000
2017-01-11,229.070007,229.979996,226.679993,229.729996,229.729996,3650800
2017-01-12,229.059998,230.699997,225.580002,229.589996,229.589996,3790200
2017-01-13,230.000000,237.850006,229.589996,237.750000,237.750000,6093000
2017-01-17,236.699997,239.960007,234.369995,235.580002,235.580002,4611900
