# Pandas
판다는 Panel Data임. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas is used to `analyze data`

##  Why Use Pandas?
Pandas allows us to `analyze big data` and `make conclusions` based on statistical theories.

Pandas can **clean messy data sets**, and make them readable and relevant.

Relevant data is very important in data science.

> Note:  Data Science: is a branch of computer science where we study how to store, use and analyze data for deriving information from it.

## What Can Pandas Do?
Pandas gives you answers about the data. Like:

* Is there a correlation between two or more columns?
* What is average value?
* Max value?
* Min value?
Pandas are also able to **delete rows that are not relevant, or contains wrong values, like empty or NULL values.** This is called `cleaning` the data.

## Where is the Pandas Codebase?
https://github.com/pandas-dev/pandas

# Getting Started
## Install Pandas

In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.



## Import Pandas
Import by `import keyword`

In [6]:
import pandas as pd
print(pd.__version__) # check version

2.2.2


In [5]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


# Pandas Series

## What is a Series? `.Series()`
A Pandas Series is like a `column in a table`

It is a `1D array` **holding data of any type**

### Example
Create a simple Pandas Series

In [7]:
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

0    1
1    7
2    2
dtype: int64


## Labels
If nothing else is specified, the **values are labeled with their index number.** First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

Example
Return the first value of the Series:

In [8]:
print(myvar[0])

1


## Create  `.Series(index=)`
With the index argument, you can name your own labels.

### Example
Create your own labels:

In [10]:
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
print(myvar["y"])

x    1
y    7
z    2
dtype: int64
7


## Key/Value Objects as Series
You can also use a `key/value objec`t, like a dictionary, when creating a Series.

### Example
Create a simple Pandas Series from a dictionary:

In [11]:
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

day1    420
day2    380
day3    390
dtype: int64


> Note : Note: The keys of the dictionary become the labels.
To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.
### Example
Create a Series using only data from "day1" and "day2":

In [12]:
myvar2 = pd.Series(calories, index = ["day1", "day2"])
print(myvar2)

day1    420
day2    380
dtype: int64


## DataFrames `.DataFrame()`
Data sets in Pandas are usually **multi-dimensional tables**, called DataFrames.

`Series` is like** a column**, a `DataFrame` is the **whole table**

### Example
Create a DataFrame from two Series:

In [13]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)
print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


# Pandas DataFrames

## What is a DataFrame?
A Pandas DataFrame is a **2D Data Structure**, like a 2 dimensional array, or a table with rows and columns.

### Example
Create a simple Pandas DataFrame:

In [32]:

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

   calories  duration
0       420        50
1       380        40
2       390        45


## Locate Row `.loc[]`
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the `loc attribute` to `return one or more specified row(s)`
> **Note: When using [], the result is a Pandas DataFrame.**

추가로 .head(N) 사용하면 0~N 까지 읽음

In [33]:
# Example
print(df.loc[0]) # return 0th row
print(" --- ")
print(df.loc[[0, 1]]) # return 0th and 1st row

calories    420
duration     50
Name: 0, dtype: int64
 --- 
   calories  duration
0       420        50
1       380        40


# Pandas - Analyzing DataFrames
## Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the `head() method.`

The head() method returns the headers and a specified number of rows, starting from the top.

### Example
Get a quick overview by printing the first 10 rows of the DataFrame:

In [29]:
print(df.head(5)) # return first 5 rows
print(df.head()) # also return first 5 rows
print(df.tail()) # return last 5 rows

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
   Duration  Pulse  Maxpulse  Calories
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


## Named Indexes `.DataFrame(index = )`
With the index argument, you can **name your own indexes**

### Example
Add a list of names to give each row a name:

In [17]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 
print(" --- ")
print(df.loc["day2"]) # return row with index day2

      calories  duration
day1       420        50
day2       380        40
day3       390        45
 --- 
calories    380
duration     40
Name: day2, dtype: int64


# Read CSV Files
참고로 csv 는 comma seperated values 임.

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

> **Tip: use to_string() to print the entire DataFrame.... But... If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows**

## Export DataFrame to CSV `.to_csv`

In [18]:
## Export DataFrame to CSV
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data)
df.to_csv('mydata.csv')

## Load Files Into a DataFrame `.read_csv`
If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [21]:
df = pd.read_csv('mydata.csv')
print(df)
# 자매품
# df = pd.read_json('mydata.json')
print("---")
print(df.to_string()) # print entire DataFrame without truncation

   Unnamed: 0  calories  duration
0           0       420        50
1           1       380        40
2           2       390        45
---
   Unnamed: 0  calories  duration
0           0       420        50
1           1       380        40
2           2       390        45


## max_rows `pd.options.display.max_rows` 
The **number of rows returned** is defined in Pandas option settings.

You can check your system's maximum rows with the `pd.options.display.max_rows` statement.

### Example
Check the number of maximum returned rows:

In [22]:
print(pd.options.display.max_rows) 

60


> **In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows**

You can change the maximum rows number with the same statement.

Increase the maximum number of rows to display the entire DataFrame:

In [23]:
pd.options.display.max_rows = 1000
print(pd.options.display.max_rows)

1000


# Pandas Read JSON

Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

> Tip: use to_string() to print the entire DataFrame.

In [24]:
# Save json
df.to_json('mydata.json')

In [25]:
# Read json
df = pd.read_json('mydata.json')
print(df.to_string())

   Unnamed: 0  calories  duration
0           0       420        50
1           1       380        40
2           2       390        45


## Dictionary as JSON
`JSON = Python Dictionary`

JSON objects have the same format as Python dictionaries.

If your `JSON code` is not in a file, but **in a Python Dictionary**, **you can load it into a DataFrame directly:**

### Example
Load a Python Dictionary into a DataFrame:

In [34]:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

In [36]:
df = pd.DataFrame(data)
print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


## Info About the Data `info()`
The DataFrames object has a method called info(), that gives you more information about the data set.

### Null Values
The info() method also tells us `how many Non-Null values` there are present in each column

Empty values, or Null values, can be bad when analyzing data, and you **should consider removing rows with empty values.** This is a step towards what is called **cleaning data**

In [37]:
print(df.info()) 

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Duration  6 non-null      int64
 1   Pulse     6 non-null      int64
 2   Maxpulse  6 non-null      int64
 3   Calories  6 non-null      int64
dtypes: int64(4)
memory usage: 240.0+ bytes
None
