# Using Pandas to Get Familiar With Your Data

The first step in any machine learning project is familiarize yourself with the data. 

모든 머신러닝 프로젝트의 첫 번째 스텝은 네가 데이터와 친숙해지는 거임.

You'll use the Pandas library for this. 

너는 이것을 위해 판다스 라이브러리를 사용할 거임.

Pandas is the primary tool data scientists use for exploring and manipulating data.  

판다스는 데이터 사이언티스트들이 데이터를 탐구하고 조작하기 위해 사용하는 주요 도구 데이터임.

Most people abbreviate pandas in their code as `pd`.  We do this with the command

많은 사람들은 판다스를 pd로 줄여서 사용하고 우리도 이 명령어를 사용할 거임.

In [1]:
import pandas as pd

The most important part of the Pandas library is the DataFrame. 

판다스 라이브러리의 가장 중요한 부분은 DataFrame임.

A DataFrame holds the type of data you might think of as a table. 

DataFrame은 테이블로 생각할 수 있는 데이터 유형임.

This is similar to a sheet in Excel, or a table in a SQL database. 

이거는 엑셀의 시트나 SQL DB의 테이블이랑 유사함.

Pandas has powerful methods for most things you'll want to do with this type of data.  

판다스는 이런 종류의 데이터들로 여러분들이 하고 싶어하는 대부분의 것들에 대한 강력한 방법을 가지고 있음.

As an example, we'll look at [data about home prices](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot) in Melbourne, Australia.

예를 들면 우리는 호주 멜버른 집값에 대한 데이터를 살펴볼 거임.

In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.
 
실습에서는 동일한 프로세스를 아이오와에 있는 집값이 포함된 새 데이터 세트에 적용함.

The example (Melbourne) data is at the file path **`../input/melbourne-housing-snapshot/melb_data.csv`**.

예시 (멜버른) 데이터의 파일 경로는 다음과 같음.(나는 다름)

We load and explore the data with the following commands:

우리는 다음 명령어를 따라 데이터를 로드하고 탐색함.

In [4]:
# save filepath to variable for easier access
melbourne_file_path = './input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [5]:
melbourne_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


# Interpreting Data Description

데이터 설명 해석

The results show 8 numbers for each column in your original dataset. 

결과는 오리지널 데이터셋에서 8개의 각 컬럼은 보여준다.

The first number, the **count**,  shows how many rows have non-missing values.  

첫 번째로 카운트는 결측값이 없는 행 수를 표시함.

Missing values arise for many reasons. 

미싱 밸류는 여러가지 이유로 발생함.

For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. 

예를 들어서 첫 번째 침실을 조사할 때 침실 2의 크기는 수집되지 않음.

We'll come back to the topic of missing data.

데이터 누락에 대한 주제로 돌아가겠음.

The second value is the **mean**, which is the average.

두 번째 값은 mean, 평균임.

Under that, **std** is the standard deviation, which measures how numerically spread out the values are.
  
그 밑에 std는 표준편차이고, 값이 얼마나 숫자적으로 분포되어 있는지 측정함.

To interpret the **min**, **25%**, **50%**, **75%** and **max** values, imagine sorting each column from lowest to highest value.

min, 25%, 50%, 75%, max값을 해석하려면 가장 낮은 값에서 가장 높은 값으로 정렬해야 한다고 상상해보자.

The first (smallest) value is the min.  
첫 번째(가장 낮은) 값은 min임. 

If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.

목록을 1/4로 살펴보면, 값의 25%보다 크고 값의 75%보다 작은 숫자를 찾을 수 있음.

That is the **25%** value (pronounced "25th percentile").

이것이 바로 25% 값임(25th percentile이라고 발음함).

The 50th and 75th percentiles are defined analogously, and the **max** is the largest number.

50번째 백분위 수와 75번째 백분위수가 유사하게 정의되었고, max가 가장 큰 숫자임.

# Your Turn
Get started with your **[first coding exercise](https://www.kaggle.com/kernels/fork/1258954)**

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*