## Pandas
- A powerful library for handling tabular data.
- Designed for structured data formats such as Excel, CSV, etc.
- Raw data refers to unprocessed data collected for analysis, meaning data that has not yet been cleaned for analysis.
- Widely used for data preprocessing, which involves transforming data before analysis.
- Since Pandas has numerous features, simply learning the syntax is not enough. Practicing with a variety of raw datasets is essential for mastering its functionality.
## Data Types
- object : Represents Python's str type or mixed data types.
- int64 : Represents integer values, equivalent to Python's int type.
- float64 : Represents floating-point numbers, equivalent to Python's float type.
- bool : Represents boolean values (True or False), equivalent to Python's bool type.
- datetime64 : Represents date and time data (datetime).

In [5]:
# Calling the library
import pandas as pd

## Series
- Provides Series and DataFrame to handle data.
- Series represents one-dimensional data, while DataFrame represents two-dimensional tabular data.

In [6]:
# Series without categories
mySeries1 = pd.Series([70, 60, 90])
print(type(mySeries1))  # Check the type of mySeries1
mySeries1  # Display the Series

<class 'pandas.core.series.Series'>


0    70
1    60
2    90
dtype: int64

In [7]:
# Series with categories (custom index)
mySeries2 = pd.Series([70, 60, 90], index=["language", "math", "science"])
print(type(mySeries2))  # Check the type of mySeries2
mySeries2  # Display the Series

<class 'pandas.core.series.Series'>


language    70
math        60
science     90
dtype: int64

## Series Data CRUD

In [8]:
# Series with categories (custom index)
mySeries2 = pd.Series([70, 60, 90], index=["language", "math", "science"])
print(type(mySeries2))  # Check the type of mySeries2
mySeries2  # Display the Series


<class 'pandas.core.series.Series'>


language    70
math        60
science     90
dtype: int64

In [9]:
mySeries2.index

Index(['language', 'math', 'science'], dtype='object')

In [10]:
mySeries2.index = ["science", "mechanic", "biology"]

In [11]:
mySeries2

science     70
mechanic    60
biology     90
dtype: int64

In [12]:
# Retrieve values
mySeries2.values

array([70, 60, 90])

In [13]:
# Retrieve values
mySeries2.values

array([70, 60, 90])

In [14]:
# Retrieve mechanic score 1
print(mySeries2["mechanic"])

# Retrieve mechanic score 2 > Use iloc index number to access
print(mySeries2.iloc[1])

60
60


In [15]:
# Modify specific value 1
mySeries2["mechanic"] = 100
mySeries2

science      70
mechanic    100
biology      90
dtype: int64

In [16]:
# Modify specific value 2
mySeries2.iloc[1] = 20
mySeries2

science     70
mechanic    20
biology     90
dtype: int64

In [17]:
# Delete data
del mySeries2["mechanic"]
mySeries2

science    70
biology    90
dtype: int64

## Change data type: Series.astype(new type)

In [18]:
ex_series = pd.Series([10,20,30,40,50])
print(type(ex_series))
ex_series

<class 'pandas.core.series.Series'>


0    10
1    20
2    30
3    40
4    50
dtype: int64

In [19]:
ex_series = ex_series.astype("float")
print(type(ex_series))
ex_series

<class 'pandas.core.series.Series'>


0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

In [20]:
ex_series = ex_series.astype("object")
print(type(ex_series))
ex_series

<class 'pandas.core.series.Series'>


0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: object

## Understanding DataFrame

In [21]:
# DataFrame is created in dictionary form. Key = Column, Value = Row
# DataFrame without index
df1 = pd.DataFrame({
    "USA" : [100,200,300,400],
    "South Korea" : [100,200,300,400],
    "Italy" : [100,200,300,400]
})
df1


Unnamed: 0,USA,South Korea,Italy
0,100,100,100
1,200,200,200
2,300,300,300
3,400,400,400


In [22]:
# DataFrame with index
df2 = pd.DataFrame({
    "USA" : [100,200,300,400],
    "South Korea" : [100,200,300,400],
    "Italy" : [100,200,300,400]},
    index=["A","B","C","D"] # -> The index must match the number of rows
)
df2


Unnamed: 0,USA,South Korea,Italy
A,100,100,100
B,200,200,200
C,300,300,300
D,400,400,400


In [23]:
# Check the index
df2.index

Index(['A', 'B', 'C', 'D'], dtype='object')

In [24]:
# Modify the index
df2.index = ["A1000", "B2000", "C3000", "D4000"]
df2

Unnamed: 0,USA,South Korea,Italy
A1000,100,100,100
B2000,200,200,200
C3000,300,300,300
D4000,400,400,400


In [25]:
# Check the columns
df2.columns

Index(['USA', 'South Korea', 'Italy'], dtype='object')

In [26]:
# Modify the columns
df2.columns = [10, 20, 30]
df2

Unnamed: 0,10,20,30
A1000,100,100,100
B2000,200,200,200
C3000,300,300,300
D4000,400,400,400


In [27]:
# Retrieve the data
df2.values

array([[100, 100, 100],
       [200, 200, 200],
       [300, 300, 300],
       [400, 400, 400]])

In [28]:
new_df = pd.DataFrame({
    "Year" : [2000,2001,2002],
    "USA" : [2.1, 2.2, 2.3],
    "South Korea" : [1.1, 1.2, 1.3],
    "China" : [0.1, 0.2, 0.3]})

new_df

Unnamed: 0,Year,USA,South Korea,China
0,2000,2.1,1.1,0.1
1,2001,2.2,1.2,0.2
2,2002,2.3,1.3,0.3


In [29]:
# Select a specific column as the index
new_df = new_df.set_index("Year")
new_df

Unnamed: 0_level_0,USA,South Korea,China
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,2.1,1.1,0.1
2001,2.2,1.2,0.2
2002,2.3,1.3,0.3


In [30]:
# Remove a specific column set as the index
new_df = new_df.reset_index("Year")
new_df

Unnamed: 0,Year,USA,South Korea,China
0,2000,2.1,1.1,0.1
1,2001,2.2,1.2,0.2
2,2002,2.3,1.3,0.3


## Accessing data in a DataFrame
- loc : Access values using the index
- iloc : Access values using the index number (starts from 0)

In [31]:
myDf = pd.DataFrame({
    "USA" : [5,6,7,8,9,10],
    "South Korea" : [4,5,6,7,8,9],
    "India" : [3,4,5,6,7,8],
    "China" : [2,3,4,5,6,7]},
    index=["A","B","C","D","E","F"])

myDf

Unnamed: 0,USA,South Korea,India,China
A,5,4,3,2
B,6,5,4,3
C,7,6,5,4
D,8,7,6,5
E,9,8,7,6
F,10,9,8,7


In [32]:
# Retrieve a specific row 1: loc
print(type(myDf.loc["C"]))
myDf.loc["C"]

<class 'pandas.core.series.Series'>


USA            7
South Korea    6
India          5
China          4
Name: C, dtype: int64

In [33]:
# Retrieve a specific row 2: iloc
print(type(myDf.iloc[2]))
myDf.iloc[2]

<class 'pandas.core.series.Series'>


USA            7
South Korea    6
India          5
China          4
Name: C, dtype: int64

In [34]:
# Retrieve a specific column 1
print(type(myDf["USA"]))
myDf["USA"]

<class 'pandas.core.series.Series'>


A     5
B     6
C     7
D     8
E     9
F    10
Name: USA, dtype: int64

In [35]:
# Retrieve a specific column 2
print(type(myDf.iloc[:, 0]))
myDf.iloc[:, 0]

<class 'pandas.core.series.Series'>


A     5
B     6
C     7
D     8
E     9
F    10
Name: USA, dtype: int64

In [36]:
# Retrieve data values
myDf.values

array([[ 5,  4,  3,  2],
       [ 6,  5,  4,  3],
       [ 7,  6,  5,  4],
       [ 8,  7,  6,  5],
       [ 9,  8,  7,  6],
       [10,  9,  8,  7]])

In [37]:
# Add a new column
myDf["France"] = [1,2,3,4,5,6]

myDf

Unnamed: 0,USA,South Korea,India,China,France
A,5,4,3,2,1
B,6,5,4,3,2
C,7,6,5,4,3
D,8,7,6,5,4
E,9,8,7,6,5
F,10,9,8,7,6


In [38]:
# Delete a column
del myDf["China"]
myDf

Unnamed: 0,USA,South Korea,India,France
A,5,4,3,1
B,6,5,4,2
C,7,6,5,3
D,8,7,6,4
E,9,8,7,5
F,10,9,8,6


In [39]:
# Delete a row using index
myDf = myDf.drop(["F"])
myDf

Unnamed: 0,USA,South Korea,India,France
A,5,4,3,1
B,6,5,4,2
C,7,6,5,3
D,8,7,6,4
E,9,8,7,5


In [40]:
# Select and copy DataFrame column
# - copy()
# - Often used when we want to keep the original data intact while copying and processing the data
# - It is commonly used when we want to preserve the original data and only select specific columns for analysis
myDf_copy1 = myDf.copy()
myDf_copy1

Unnamed: 0,USA,South Korea,India,France
A,5,4,3,1
B,6,5,4,2
C,7,6,5,3
D,8,7,6,4
E,9,8,7,5


In [41]:
myDf_copy2 = myDf[["USA","South Korea"]].copy()
myDf_copy2

Unnamed: 0,USA,South Korea
A,5,4
B,6,5
C,7,6
D,8,7
E,9,8


In [42]:
myDf_copy2.columns = ["South Korea", "Japan"]
myDf_copy2

Unnamed: 0,South Korea,Japan
A,5,4
B,6,5
C,7,6
D,8,7
E,9,8
