# pandas lesson 2 (Dataframes)

 This lesson shows examples of typical operations on a pandas DataFrame including:
* create a DataFrame using a pandas read_... method
* select a subset of columns
* calculate new columns
* filter rows by values or by index
* sort rows by index or by values

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # pandas uses matplotlib for plotting

A DataFrame is basically a table.  It is a 2D labelled data structure.  The columns can be different types.  You can think of it as a dict of Series objects (columns) if that helps.  Like a Series, a Dataframe has an index column.

We can build a pandas Dataframe in many ways, for example from a dict. The dict's keys become the column names and the dict's values become the column values.

We can use the pandas read_csv method to load data directly into a DataFrame.  WE will use the epl DataFrame in the following sections.This contains the football English Premier League results from teh 2023-24 season as provided by ChatGPT!)

In [None]:
epl_csv_file_url = "https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/EPL%20Results%202023-24.csv"
epl = pd.read_csv(epl_csv_file_url)
epl.head()

Use the three letter code for the team as the index

In [None]:
epl.set_index('Code', inplace=True)
epl.head()

Exercise: Examine the dataframe.  
Use the following dataframe properties and methods: index, head(), describe(), shape, values, columns.

In [None]:
# Write your code here as a set of print statements. The first one is provided.
print("Index:\n", epl.index)

### Access columns in a Dataframe

We can access a columns or columns, col1, col2 in a DataFrame df in several ways: 
* df['col1'] which returns a series
* df.col1 which returns a series (pandas allows us to refer to column names like a property! )
* df[['col1']] or df[['col1', 'col2']] which return DataFrames
* df.loc[ : , ['col1', 'col2']] which return DataFrames and is possibly best practice and most flexible

Task: experiment with various ways of selecting columns of the epl DataFrame.

In [None]:
# Here are some examples
print("Using dot notation:\n", epl.Team.head(2))
print("Using [] slice notation:\n", epl['Team'].head(2))
# Write your code here


### Sort rows
We can sort by values or sort by index.  Sort by values is the more usual case.  

Order the rows in the dataframe by the number of games won (low to high), then in the case of any ties, by the number of games drawn.

In [None]:
epl.sort_values(by = ['Won', 'Drawn'], ascending=True)

Task: sort rows by the GF and GA column (highest first)

In [None]:
# Write your code here

We can also sort a DataFrame by its index.

In [None]:
epl_sorted_by_index = epl.sort_index()
epl_sorted_by_index.head()

### Create new columns

In [None]:
# Teams get 3 point for a win, 1 for a draw, none for a loss
epl['Points'] = epl['Won'] * 3 + epl['Drawn']
epl.head()

Task: Create a new column, Played, to hold the number of games played by each team.


In [None]:
# Write your code here

Task: Create a new column, GD, to hold the goal difference, calculated as GF - GA (Goals For - Goals Against)

In [None]:
# Write your code here

### Filter rows
Filter both rows and columns of the dataframe in various ways using the loc method and the index / column names.

In [None]:
epl.loc['LIV', :] # one row, all columns, returns a Series

In [None]:
epl.loc[['LIV'], :] # one row, all columns, returns a DataFrame

In [None]:
epl.loc[['LIV', 'MCY'], :] #   two rows, all columns, returns a DataFrame

In [None]:
epl.loc[['LIV', 'MCY'], ['Team', 'Won']] # two rows, two columns

### Filter Rows by Values

We have already seen that a boolean expression returns a Series of bools

In [None]:
epl.Won >= 25


We can use this boolean expression to filter the rows of the DataFrame where the bools are True

In [None]:
epl.loc[epl.Won >= 25, :]

Task: filter the epl DataFrame to return only teams with 80 points or more

In [None]:
# Write your code here


Filter teams that have Manchester in their name

In [None]:

epl.loc[epl.Team.str.contains('Manchester'), :]

We can combine two filter criteria with an & (and)

In [None]:
epl.Team.str.contains('Manchester') & (epl.Won > 25)

Task:  filter the epl dataframe to return only London teams that have won more than 5 games

In [None]:
# Write your code here