# Intro To Pandas Dataframe
---
Pandas is built on top of Numpy.

Pandas is our go to library to handles data that can be used to manipulate, clean and visualize data and perform analysis on it.
It is the Python equivalent of SQL for relational databases.

## Two main datatypes:

1. Series: a one dimensional DataFrame array
2. DataFrame: a series of series, a collection of panda series with the same functionality. This is a table of multiple dimensions.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

---
---
### Topics Covered

1. Creating DataFrames
2. DataFrame functions: (head, shape, describe, info)
3. Unique Values of columns
4. Accessing names of a column: use .tolist() to store as a list
5. Slicing a dataframe with a specific conditions: 
  
    df[df['Column Name']condition

6. Slicing a dataframe using iloc[rows, columns]
7. Viewing dataframe with multiple conditions: 

    df[(df['Column1']condition1 & df['Column2']condition2)

In [1]:
# 1st: Import the libraries

import pandas as pd
import numpy as np

## Constructing a DataFrame from a Dictionary
---

#### Pandas takes a collection of values & keys stored as dictionaries & creates a table out of it

In [2]:
d = {'col1': [1, 2], 'col2': [3, 4]}

df = pd.DataFrame(data=d)

display(df)

Unnamed: 0,col1,col2
0,1,3
1,2,4


#### Creating a simple sales table out of a pandas dataframe:

 - Dataframes are created from a dict
 - key in dict --> Top row titles (across)
 - values in dict --> Column index values (down)

In [3]:
p = {'Sale Qty': [1,5,2,5], 'Price': [1.30, 2.50, 1.35, 2.25]}

df = pd.DataFrame(data = p, index = ['Apples', 'Oranges' , 'Mangos', 'Cherries'])

display(df)

Unnamed: 0,Sale Qty,Price
Apples,1,1.3
Oranges,5,2.5
Mangos,2,1.35
Cherries,5,2.25


#### Performing operations on a dataframe:

 - Add an additional column "Total" which calculates the gross sale of the product

In [4]:
df['Total'] = df['Sale Qty'] * df['Price']

display(df)

Unnamed: 0,Sale Qty,Price,Total
Apples,1,1.3,1.3
Oranges,5,2.5,12.5
Mangos,2,1.35,2.7
Cherries,5,2.25,11.25


#### Pandas inspecting dataframes:
 - Check shape of dataframe
 - Display information on each row & column: `describe()` method
 - Show statistics of each column: `info()` method

In [5]:
print(f'shape of dataframe: {df.shape}\n')

display(df.describe())

df.info()

shape of dataframe: (4, 3)



Unnamed: 0,Sale Qty,Price,Total
count,4.0,4.0,4.0
mean,3.25,1.85,6.9375
std,2.061553,0.615088,5.75259
min,1.0,1.3,1.3
25%,1.75,1.3375,2.35
50%,3.5,1.8,6.975
75%,5.0,2.3125,11.5625
max,5.0,2.5,12.5


<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Apples to Cherries
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Sale Qty  4 non-null      int64  
 1   Price     4 non-null      float64
 2   Total     4 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 128.0+ bytes


## Pandas: Working with Datasets
---

#### Accessing a data file

```python
df = pd.read_(csv, json, sql)
```

 - choose data format: `read_csv`, `read_json`, `read_sql`

#### Read a large dataframe from an external link:
 - this file is stored as a CSV named "data"

In [6]:
data = pd.read_csv('https://raw.githubusercontent.com/rbhatia46/Numpy-Pandas-Beginner-Tutorial/master/RegularSeasonCompactResults.csv')

display(data)

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0


#### Create a new variable `df` that is a copy of the external `data`:

In [7]:
df = data.copy()

display(df.head(8))   # Note: no argument shows the 1st 5 lines. df.head(10) would show the 1st 10 lines, etc.

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
5,1985,25,1218,79,1337,78,H,0
6,1985,25,1228,64,1226,44,N,0
7,1985,25,1242,58,1268,56,N,0


#### Show the final rows in the dataframe:
 - `.tail()` method is the opposite of `.head()`

In [8]:
display(df.tail())

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0
145288,2016,132,1386,87,1433,74,N,0


#### Analyze data using `.info()` method:

In [9]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145289 entries, 0 to 145288
Data columns (total 8 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Season  145289 non-null  int64 
 1   Daynum  145289 non-null  int64 
 2   Wteam   145289 non-null  int64 
 3   Wscore  145289 non-null  int64 
 4   Lteam   145289 non-null  int64 
 5   Lscore  145289 non-null  int64 
 6   Wloc    145289 non-null  object
 7   Numot   145289 non-null  int64 
dtypes: int64(7), object(1)
memory usage: 8.9+ MB


None

#### Analyze data using `.describe()` method:

In [10]:
display(df.describe())

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Numot
count,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0
mean,2001.574834,75.223816,1286.720646,76.600321,1282.864064,64.497009,0.044387
std,9.233342,33.287418,104.570275,12.173033,104.829234,11.380625,0.247819
min,1985.0,0.0,1101.0,34.0,1101.0,20.0,0.0
25%,1994.0,47.0,1198.0,68.0,1191.0,57.0,0.0
50%,2002.0,78.0,1284.0,76.0,1280.0,64.0,0.0
75%,2010.0,103.0,1379.0,84.0,1375.0,72.0,0.0
max,2016.0,132.0,1464.0,186.0,1464.0,150.0,6.0


In [11]:
print(f'Shape of df: {df.shape}')

Shape of df: (145289, 8)


## Understanding the Data
---

#### Accessing the columns by themselves:

In [12]:
display(df.columns) # outputs the names of each column

Index(['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc',
       'Numot'],
      dtype='object')

#### Display all values in column "Season":

In [13]:
display(df['Season'])

0         1985
1         1985
2         1985
3         1985
4         1985
          ... 
145284    2016
145285    2016
145286    2016
145287    2016
145288    2016
Name: Season, Length: 145289, dtype: int64

#### display all unique values in the column "Season":
 - note that this is an array

In [14]:
display(df['Season'].unique())

# make unique values into a numpy array:
unique_vals = df['Season'].unique()

print(unique_vals)
print(type(unique_vals))

array([1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995,
       1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
       2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016])

[1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
 2013 2014 2015 2016]
<class 'numpy.ndarray'>


#### Save array values `df['Season'].unique()` as a list to print with formatting:

In [15]:
# save non-repeating array values to a list (method of a method)
season_unique = df['Season'].unique().tolist()

# print the list of non-repeating unique values
print(f'\'season_unique\' Values: {season_unique}')

'season_unique' Values: [1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]


#### Check datatypes:

In [16]:
print(f'Type of \'season_unique\': {type(season_unique)}\n') # this is a list
print(f'Type of \'unique_vals\'', type(unique_vals)) # this is an ndarray

print(f'\nLength of \'season_unique\': {len(season_unique)}')

Type of 'season_unique': <class 'list'>

Type of 'unique_vals' <class 'numpy.ndarray'>

Length of 'season_unique': 32


#### Save non-repeating array values to a list (method of a method):

In [17]:
daynum_unique = df['Daynum'].unique().tolist()

# print the list
print(f'\'daynum_unique\' Values: {daynum_unique}')

# How many unique values are in this list (length of the list)?
print(f'\nLength of \'daynum_unique\': {len(daynum_unique)}') 

'daynum_unique' Values: [20, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 24, 64, 18, 19, 21, 23, 16, 17, 22, 11, 15, 12, 13, 14, 10, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7]

Length of 'daynum_unique': 133


#### Getting maximum and minimum values of every column:

In [18]:
print(f'All Maximum Values:\n\n{df.max()}\n')
print(f'All Minimum Values:\n\n{df.min()}\n')

All Maximum Values:

Season    2016
Daynum     132
Wteam     1464
Wscore     186
Lteam     1464
Lscore     150
Wloc         N
Numot        6
dtype: object

All Minimum Values:

Season    1985
Daynum       0
Wteam     1101
Wscore      34
Lteam     1101
Lscore      20
Wloc         A
Numot        0
dtype: object



#### Particular column max, min, mean, and sum values:

In [19]:
print(f'Maximum Value in Wscore Column:\n\n', df['Wscore'].max())
print(f'\nMinimum Value in Wscore Column:\n\n', df['Wscore'].min())
print(f'\nMean Value in Wscore Column:\n\n', df['Wscore'].mean())
print(f'\nSum of Wscore Column:\n\n', df['Wscore'].sum())

Maximum Value in Wscore Column:

 186

Minimum Value in Wscore Column:

 34

Mean Value in Wscore Column:

 76.60032074004226

Sum of Wscore Column:

 11129184


#### How many times was a particular value repeated in a column?
 - Use the `.value_counts()` method.

In [20]:
print('Value:  # Repeats:\n')

display(df['Season'].value_counts())

Value:  # Repeats:



2016    5369
2014    5362
2015    5354
2013    5320
2010    5263
2012    5253
2009    5249
2011    5246
2008    5163
2007    5043
2006    4757
2005    4675
2003    4616
2004    4571
2002    4555
2000    4519
2001    4467
1999    4222
1998    4167
1997    4155
1992    4127
1991    4123
1996    4122
1995    4077
1994    4060
1990    4045
1989    4037
1993    3982
1988    3955
1987    3915
1986    3783
1985    3737
Name: Season, dtype: int64

#### Does the dataset have any NULL values?

 - Use the `isnull()` method

In [21]:
display(df.isnull())

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
145284,False,False,False,False,False,False,False,False
145285,False,False,False,False,False,False,False,False
145286,False,False,False,False,False,False,False,False
145287,False,False,False,False,False,False,False,False


## Accessing Values in a DataFrame
---
One method: iloc
  
  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

---

#### Call the dataframe itself to look at it:

In [22]:
display(df.head())

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


#### What is the maximum value in the 'Wscore' column?
 - Use the `max()` method

In [23]:
print(df['Wscore'].max()) # The maximum value for 'Wscore' = 186

186


#### Question: What are the values of the dataframe where Wscore is max?
 
 - Desired output: Values of each feature where the Wscore has maximum value
 - This is an example of the incorrect output

In [24]:
# What happens? df['Wscore'].max() returns VALUE 186, so iloc uses that as the index
display(df.iloc[[df['Wscore'].max()]] )

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
186,1985,33,1155,79,1375,69,H,0


#### Proper Output: specify the argument maximum
 - Use the `argmax()` method

In [25]:
# displays the entire row of data by default
# df['Wscore'].argmax() returns the LOCATION of the maximum value Wscore 186

display(df.iloc[[df['Wscore'].argmax()]] )

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
24970,1991,68,1258,186,1109,140,H,0


#### To access only one column of data where the Wscore max of 186 is located:

In [26]:
display(df.iloc[[df['Wscore'].argmax()]]['Season'])

24970    1991
Name: Season, dtype: int64

#### Slicing a dataframe:
 - Python indexing starts from 0
 - In Pandas, we use `iloc` method
 - useful for working on a specific part of the dataframe
 - Syntax: `df.iloc[start row : end row, start column : end column]`

In [27]:
df.iloc[10:14, 2:5] # Row 10 up to / not including 14, column 2 up to / not including row 5

Unnamed: 0,Wteam,Wscore,Lteam
10,1307,103,1288
11,1344,75,1438
12,1374,91,1411
13,1412,70,1397


#### Save the sliced dataframe as a new dataframe:
 - Can do any operations on the new dataframe without affecting the original

In [28]:
df_s = df.iloc[10:14, 2:5]

display(df_s)

Unnamed: 0,Wteam,Wscore,Lteam
10,1307,103,1288
11,1344,75,1438
12,1374,91,1411
13,1412,70,1397


#### All rows with only columns 2 and 3

In [29]:
df_t = df.iloc[:, 2:4]    

display(df_t)

Unnamed: 0,Wteam,Wscore
0,1228,81
1,1106,77
2,1112,63
3,1165,70
4,1192,86
...,...,...
145284,1114,70
145285,1163,72
145286,1246,82
145287,1277,66


#### Data sorting with Conditionals:

In [30]:
# start with original dataframe:

display(df.head())

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


#### What if we want to look at the 'Wscore' values that are greater than the average value?

In [31]:
print(f'Average Wscore:\n\n', df['Wscore'].mean())

print('\nShape of dataframe with values >= 77:\n\n', df[df['Wscore'] >= 77].shape)

Average Wscore:

 76.60032074004226

Shape of dataframe with values >= 77:

 (69283, 8)


#### Display all rows with $ \mathrm{Wscore} \geq 77$

In [32]:
display(df[df['Wscore'] >= 77])

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
4,1985,25,1192,86,1447,74,H,0
5,1985,25,1218,79,1337,78,H,0
8,1985,25,1260,98,1133,80,H,0
...,...,...,...,...,...,...,...,...
145278,2016,131,1386,82,1173,79,N,0
145279,2016,131,1392,80,1436,74,H,0
145281,2016,131,1419,82,1426,71,N,0
145286,2016,132,1246,82,1401,77,N,1


#### How to implement multiple conditions on a DataFrame as a boolean expression?

 - Show all values in dataframe which are above the average Wscore, which happened after 2011:

In [33]:
display(df[(df['Season'] > 2011) & (df['Wscore'] >= 77)])

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
118636,2012,9,1385,78,1250,73,H,0
118637,2012,9,1401,81,1251,59,H,0
118638,2012,11,1102,87,1119,71,H,0
118640,2012,11,1113,78,1286,72,H,0
118641,2012,11,1116,83,1367,63,H,0
...,...,...,...,...,...,...,...,...
145278,2016,131,1386,82,1173,79,N,0
145279,2016,131,1392,80,1436,74,H,0
145281,2016,131,1419,82,1426,71,N,0
145286,2016,132,1246,82,1401,77,N,1


## Extracting Rows and Columns
---
Another Method: loc

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

---

#### Look at only 2 columns we are interested in across all indices:

In [34]:
display(df[['Season', 'Wscore']])

Unnamed: 0,Season,Wscore
0,1985,81
1,1985,77
2,1985,63
3,1985,70
4,1985,86
...,...,...
145284,2016,70
145285,2016,72
145286,2016,82
145287,2016,66


#### Same operation using `.loc` method:

In [35]:
df.loc[:,['Season', 'Wscore']]

Unnamed: 0,Season,Wscore
0,1985,81
1,1985,77
2,1985,63
3,1985,70
4,1985,86
...,...,...
145284,2016,70
145285,2016,72
145286,2016,82
145287,2016,66


#### Also see the `groupby` method:

In [36]:
display(df.groupby(['Season']).head())

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
139920,2016,11,1104,77,1244,64,H,0
139921,2016,11,1105,68,1408,67,A,1
139922,2016,11,1112,79,1334,61,H,0
139923,2016,11,1115,58,1370,56,A,0
