# Intro To Pandas Dataframe
---
Pandas is built on top of Numpy.

Pandas is our go to library to handles data that can be used to manipulate, clean and visualize data and perform analysis on it.
It is the Python equivalent of SQL for relational databases.

## Two main datatypes:

1. Series: a one dimensional DataFrame array
2. DataFrame: a series of series, a collection of panda series with the same functionality. This is a table of multiple dimensions.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

---
---
### Topics Covered

1. Creating DataFrames
2. DataFrame functions: (head, shape, describe, info)
3. Unique Values of columns
4. Accessing names of a column: use .tolist() to store as a list
5. Slicing a dataframe with a specific conditions: 
  
    df[df['Column Name']condition

6. Slicing a dataframe using iloc[rows, columns]
7. Viewing dataframe with multiple conditions: 

    df[(df['Column1']condition1 & df['Column2']condition2)

In [None]:
# 1st: Import the libraries

import pandas as pd
import numpy as np

## Constructing a DataFrame from a Dictionary
---

In [None]:
# pandas takes a collection of values and keys that are stored as dictionaries and create a table out of it

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df

In [None]:
# creating a simple sales table out of a pandas dataframe

'''
Dataframes are created from a dict
key in dict --> Top row titles (across)
values in dict --> Column index values (down)
'''

p = {'Sale Qty': [1,5,2,5], 'Price': [1.30, 2.50, 1.35, 2.25]}
df = pd.DataFrame(data = p, index = ['Apples', 'Oranges' , 'Mangos', 'Cherries'])
df


In [None]:
# Performing operations on a dataframe
# Add an additional column "Total" which calculates the gross sale of the product

df['Total'] = df['Sale Qty'] * df['Price']
df

In [None]:
# Pandas inspecting dataframes

print(f'shape of dataframe: {df.shape}')

In [None]:
# information on each row and column of dataframe

df.info()

In [None]:
# shows statistics of each column

df.describe()

## Pandas: Working with Datasets
---

In [None]:
# Accessing a data file
'''
df = pd.read_(csv, json, sql)
'''
# read a large dataframe from an external link:
# this file is stored as a CSV named "data"

data = pd.read_csv('https://raw.githubusercontent.com/rbhatia46/Numpy-Pandas-Beginner-Tutorial/master/RegularSeasonCompactResults.csv')

# display external data:

data

In [None]:
# create a new variable "df" that is a copy of the external data:

df = data.copy()

# Get a view of the dataframe

df.head(8)   # Note: no argument shows the 1st 5 lines. df.head(10) would show the 1st 10 lines, etc.

In [None]:
df.tail()   # Opposite of .head() - Shows the final 5 rows

In [None]:
# Analyzing data using .info()

df.info()

In [None]:
# Analyzing data using .describe()

df.describe()

In [None]:
# look at shape of dataframe

print(f'Shape of df: {df.shape}')

# Each Column is referred to as a "feature" and each row an "instance" of that feature
# This dataframe has 145000+ instances and 8 features

## Understanding the Data
---

In [None]:
# Accessing the columns by themselves:

df.columns    # outputs the names of the columns

In [None]:
# Display all values in column "Season"

df['Season']

In [None]:
# display all unique values in the column "Season"
# note that this is an array

df['Season'].unique()

In [None]:
# another method: Save array values "df['Season'].unique() as a list to print with formatting:

season_unique = df['Season'].unique().tolist()    # save non-repeating array values to a list (method of a method)

print(f'\'season_unique\' Values: {season_unique}')   # print the list of non-repeating unique values

In [None]:
# Check datatypes of each:

print(f'Type of \'season_unique\': {type(season_unique)}\n')    # this is a list
print(f'Type of df[\'Season\'].unique():', type(df['Season'].unique()))   # this is an ndarray

# How many unique values are in this list (length of the list)?

print(f'\nLength of \'season_unique\': {len(season_unique)}')

In [None]:
daynum_unique = df['Daynum'].unique().tolist()    # save non-repeating array values to a list (method of a method)

print(f'\'daynum_unique\' Values: {daynum_unique}')   # print the list

# How many unique values are in this list (length of the list)?

print(f'\nLength of \'daynum_unique\': {len(daynum_unique)}') 

In [None]:
# Getting maximum and minimum values of every column:

print(f'All Maximum Values:\n\n{df.max()}\n')
print(f'All Minimum Values:\n\n{df.min()}\n')

In [None]:
# Particular column max, min, mean, and sum values:

print(f'Maximum Value in Wscore Column:\n\n', df['Wscore'].max())
print(f'\nMinimum Value in Wscore Column:\n\n', df['Wscore'].min())
print(f'\nMean Value in Wscore Column:\n\n', df['Wscore'].mean())
print(f'\nSum of Wscore Column:\n\n', df['Wscore'].sum())

In [None]:
# How many times was a particular value repeated in a column?
# use the .value_counts() method.

print('Value:  # Repeats:\n')

df['Season'].value_counts()

In [None]:
# does the dataset have any NULL values?

df.isnull()

## Accessing Values in a DataFrame
---
One method: iloc
  
  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

---

In [None]:
# Call the dataframe itself to look at it:

df.head()

In [None]:
# What is the maximum value in the 'Wscore' column?

df['Wscore'].max()  # The maximum value for 'Wscore' = 186

In [None]:
# Question: What are the values of the dataframe where Wscore is max?
# Desired output: Values of each feature where the Wscore has maximum value

df.iloc[[df['Wscore'].max()]]   # What happened? df['Wscore'].max() returns VALUE 186, so iloc uses that as the index

In [None]:
# Correct way to do it: specify the argument maximum

# displays the entire row of data by default
df.iloc[[df['Wscore'].argmax()]]   # df['Wscore'].argmax() returns the LOCATION of the maximum value Wscore 186

In [None]:
# How to access only one column of data where the Wscore max of 186 is located:

df.iloc[[df['Wscore'].argmax()]]['Season']

In [None]:
# Slicing a dataframe
# Python indexing starts from 0
# In Pandas, we use iloc method

# Syntax: df.iloc[start row : end row, start column : end column]
df.iloc[10:14, 2:5]   # <-- Row 10 up to / not including 14, column 2 up to / not including row 5

#useful for working on a specific part of the dataframe

In [None]:
# save the sliced dataframe as a new dataframe:
# can do any operations on the new dataframe without affecting the original

df_s = df.iloc[10:14, 2:5]

df_s

In [None]:
df_t = df.iloc[:, 2:4]    # all rows with columns 2 and 3

df_t

In [None]:
# Conditions for sorting data

df.head()

In [None]:
# What if we want to look at the 'Wscore' values that are greater than the average value?

print(f'Average Wscore:\n\n', df['Wscore'].mean())

print('\nShape of dataframe with values > 77:\n\n', df[df['Wscore']>77].shape)

In [None]:
df[df['Wscore']>77]

In [None]:
df.head()

In [None]:
# How to implement multiple conditions on a DataFrame as a boolean expression:

# Show all values in dataframe which are above the average Wscore, which happened after 2011:

df[(df['Season']>2011) & (df['Wscore']>77)]

## Extracting Rows and Columns
---
Another Method: loc

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

---

In [None]:
# look at the head of only 2 columns across all indices:

df[['Season', 'Wscore']].head()

In [None]:
# using .loc

df.loc[:,['Season', 'Wscore']].head()

In [None]:
# Also see the "groupby" method

df.groupby(['Season']).head()