# Pandas Tutorial

In [2]:
import pandas as pd

## Creating DataFrame with Demo Data

We can specify the column name by passing the names columns as a list to the `columns` argument. Similarly we can pass the index with the `index` arugument.

In [13]:
# df = pd.DataFrame([[1,2,3], [4,5,6],[7,8,9]])
df = pd.DataFrame([[1,2,3], [4,5,6],[7,8,9]], columns = ['A', 'B', 'C']) # Adding the column names
# df = pd.DataFrame([[1,2,3], [4,5,6],[7,8,9]], columns = ['A', 'B', 'C'], index=['x', 'y','z']) # Adding the index
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9


## Dataframe Basic

### df.head()
- It will show us the top 5 rows of the dataframe by default
- We can specify the amount of rows we want to see `df.head(3)`, `df.head(2)`

In [4]:
df.head(2)

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


### df.tail()
- It will show us the bottom 5 rows of the dataframe by default
- We can specify the amount of rows we want to see `df.tail(3)`, `df.tail(2)`

In [5]:
df.tail(2)

Unnamed: 0,A,B,C
1,4,5,6
2,7,8,9


### df.columns
It will show us the name of the columns/headers. Remember this is an attribute not a method.

In [6]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

### df.index
It will give us the overview of all the records.

In [9]:
print(df.index)

# To see the list of indexes
print(df.index.to_list())

RangeIndex(start=0, stop=3, step=1)
[0, 1, 2]


### df.info()
> Gives us an overview of the dataframe.
1. Index: gives us the number of records are there and the range of index.
2. Data columns: The number of columns
3. Columnwise null value information as well as the Data type
4. An overall infromation about the data types in this dataframe
5. The size of the dataframe

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
 2   C       3 non-null      int64
dtypes: int64(3)
memory usage: 204.0 bytes


### df.describe()
> Gives us some meanigful information about our dataframe based on each column.
1. **count**: Number of records under each column.
2. **mean**: The mean of each column.
3. **std**: The standard deviation of each column.
4. **min**: The minimum value in the column.
5. **25%** (1st quartile): The value below which 25% of the data falls.
6. **50%** (Median): The middle value that splits the data into two halves.
7. **75%** (3rd quartile): The value below which 75% of the data falls.
8. **max**: The maximum value in the column.

In [12]:
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,3.0
mean,4.0,5.0,6.0
std,3.0,3.0,3.0
min,1.0,2.0,3.0
25%,2.5,3.5,4.5
50%,4.0,5.0,6.0
75%,5.5,6.5,7.5
max,7.0,8.0,9.0


### df.nunique()
> To see the number of unique values in each column

In [14]:
df.nunique()

A    3
B    3
C    3
dtype: int64

### df['col'].unique()
> This will give us the list of unique values in a column

In [15]:
df['A'].unique()

array([1, 4, 7])

### df.shape
> Tells us the number of rows and columns in the dataframe.

In [16]:
df.shape

(3, 3)

### df.size
> Returns the total number of datapoints in the data. It's actually the multiplication of rows and columns

In [17]:
df.size

9

### df.sample()
> It will return a random row. We can specify how many we want to see by `df.sample(3)`.

In [26]:
df.sample(3)

Unnamed: 0,A,B,C
2,7,8,9
1,4,5,6
0,1,2,3


## Creating DataFrame from online Data

In [19]:
coffee = pd.read_csv('https://raw.githubusercontent.com/KeithGalli/complete-pandas-tutorial/refs/heads/master/warmup-data/coffee.csv')

coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35


## Accessing the Data from a DataFrame


### df.loc[#rows, #columns]
> We can access the value of specific row(s) and col(s). We will use the column names whereas in the `iloc()`, we will use the column index.

In [29]:
# Accessing the specific row by index
coffee.loc[3]

Day            Tuesday
Coffee Type      Latte
Units Sold          20
Name: 3, dtype: object

In [30]:
# Accessing multiple rows
coffee.loc[[1,6,8]]

Unnamed: 0,Day,Coffee Type,Units Sold
1,Monday,Latte,15
6,Thursday,Espresso,40
8,Friday,Espresso,45


In [31]:
# Perform slicing
coffee.loc[5:8]

Unnamed: 0,Day,Coffee Type,Units Sold
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45


In [34]:
# Specific column
coffee.loc[5:8, ['Day']]

Unnamed: 0,Day
5,Wednesday
6,Thursday
7,Thursday
8,Friday


In [35]:
# Specific columns
coffee.loc[5:8, ['Day', 'Units Sold']]

Unnamed: 0,Day,Units Sold
5,Wednesday,25
6,Thursday,40
7,Thursday,30
8,Friday,45


In [38]:
# For all the rows and specific columns
coffee.loc[:, ['Day', 'Units Sold']]

Unnamed: 0,Day,Units Sold
0,Monday,25
1,Monday,15
2,Tuesday,30
3,Tuesday,20
4,Wednesday,35
5,Wednesday,25
6,Thursday,40
7,Thursday,30
8,Friday,45
9,Friday,35


### df.iloc[#rows, #colum_index]
> This will perform the same thing, Instead of column name we need to access via the column_index <br>
Remember when slicing, `loc` fucntion includes the upper limit, however `iloc` function does not.

In [39]:
# Accessing the specific row by index
coffee.iloc[3]

Day            Tuesday
Coffee Type      Latte
Units Sold          20
Name: 3, dtype: object

In [42]:
# Accessing multiple rows
coffee.iloc[[1,6,8]]

Unnamed: 0,Day,Coffee Type,Units Sold
1,Monday,Latte,15
6,Thursday,Espresso,40
8,Friday,Espresso,45


In [44]:
# Perform slicing
coffee.iloc[5:8]

# Look the upper limit is not inclusive

Unnamed: 0,Day,Coffee Type,Units Sold
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30


In [47]:
# Specific column
coffee.iloc[5:8, [0]] # Instead of the name of the column, we are passing the index

Unnamed: 0,Day
5,Wednesday
6,Thursday
7,Thursday


In [48]:
# Specific columns
coffee.iloc[5:8, [0,2]]

Unnamed: 0,Day,Units Sold
5,Wednesday,25
6,Thursday,40
7,Thursday,30


In [49]:
# For all the rows and specific columns
coffee.iloc[:, [0, 2]]

Unnamed: 0,Day,Units Sold
0,Monday,25
1,Monday,15
2,Tuesday,30
3,Tuesday,20
4,Wednesday,35
5,Wednesday,25
6,Thursday,40
7,Thursday,30
8,Friday,45
9,Friday,35


### Grab specific column
> We can grab all the information of a specific column

In [50]:
coffee.Day

0        Monday
1        Monday
2       Tuesday
3       Tuesday
4     Wednesday
5     Wednesday
6      Thursday
7      Thursday
8        Friday
9        Friday
10     Saturday
11     Saturday
12       Sunday
13       Sunday
Name: Day, dtype: object

In [52]:
# Another way and my favorite is
print(coffee[['Day']]) # Double square bracket returs a dataframe.
print(coffee['Day']) # Single square bracket returns a series

          Day
0      Monday
1      Monday
2     Tuesday
3     Tuesday
4   Wednesday
5   Wednesday
6    Thursday
7    Thursday
8      Friday
9      Friday
10   Saturday
11   Saturday
12     Sunday
13     Sunday
0        Monday
1        Monday
2       Tuesday
3       Tuesday
4     Wednesday
5     Wednesday
6      Thursday
7      Thursday
8        Friday
9        Friday
10     Saturday
11     Saturday
12       Sunday
13       Sunday
Name: Day, dtype: object


## Sorting Values in a Dataframe
> We can sort the values of a Dataframe

### df.sort_value('column_name')
> This will sort a column by ascending order

In [54]:
coffee.sort_values('Units Sold')

Unnamed: 0,Day,Coffee Type,Units Sold
1,Monday,Latte,15
3,Tuesday,Latte,20
0,Monday,Espresso,25
5,Wednesday,Latte,25
2,Tuesday,Espresso,30
7,Thursday,Latte,30
4,Wednesday,Espresso,35
9,Friday,Latte,35
13,Sunday,Latte,35
11,Saturday,Latte,35


### ascending = False
> Sort in descending order

In [56]:
coffee.sort_values('Units Sold', ascending=False)

Unnamed: 0,Day,Coffee Type,Units Sold
10,Saturday,Espresso,45
8,Friday,Espresso,45
12,Sunday,Espresso,45
6,Thursday,Espresso,40
4,Wednesday,Espresso,35
11,Saturday,Latte,35
13,Sunday,Latte,35
9,Friday,Latte,35
2,Tuesday,Espresso,30
7,Thursday,Latte,30


### Sort Multiple Columns
1. We need to pass the columns in a list
2. In the ascending argument, it will be also a list of 0s and 1s. 0 = False, 1 = True

In [59]:
coffee.sort_values(['Day', 'Units Sold', 'Coffee Type'], ascending=[1,0,1])

Unnamed: 0,Day,Coffee Type,Units Sold
8,Friday,Espresso,45
9,Friday,Latte,35
0,Monday,Espresso,25
1,Monday,Latte,15
10,Saturday,Espresso,45
11,Saturday,Latte,35
12,Sunday,Espresso,45
13,Sunday,Latte,35
6,Thursday,Espresso,40
7,Thursday,Latte,30
