# pandas lesson 

Objective: Wrangle Data with pandas package.  Create a dataset, add new columns, select columns, filter rows group and aggregate.

 A pandas Dataframe is a 2D object (table).
 A pandas Series is a 1D labelled object.
 pandas rows have unique labels (a strange concept to SQL people).

 This lesson shows examples of typical operations on a pandas dataframe including:
* calculating new columns
* filtering rows in various ways
* grouping and summarising 
* using the apply function and lambda functions
* creating simple plots

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # pandas uses matplotlib for plotting

In [None]:
#  Let's have a quick look at a series
p = pd.Series(np.random.randn(4), name = 'price')
p, p.index, p.values, p.dtype, p.shape, p.ndim, p.size

In [None]:
p * 100 # multiply all values by 100, an element-wise operation

In [None]:
np.abs(p) # absolute value of all elements

In [None]:
p.mean(), p.std(), p.min(), p.max(), p.median(), p.quantile(0.25), p.quantile(0.75) # some summary statistics

In [None]:
p.index = ["HSBC", "BP", "TSCO", "RDSA"] 
#p.describe()
p

In [None]:
p['BP'] # get a value in the Series


In [None]:
p['BP'] = 5.0 # set a value in the Series
p['BP']

In [None]:
'TSCO' in p # is this value a member of the Series

In [None]:
#  We can build a pandas dataframe from a dict
fb_dict = {
        'id' : ['MCY', 'LIV', 'TOT', 'CHE', 'ARL'],
        'city' :	['Manchester',	'Liverpool', 'London', 'London', 'London'],
        'team' :	['Manchester City', 'Liverpool', 'Tottenham Hotspur', 'Chelsea', 'Arsenal'],
        'won' :	[5, 6, 6, 5, 5],
        'drawn' : [4, 1, 0, 2,0],
        'lost' : [0, 0, 2, 0, 2],
        'form' : ['DWWWW', 'WWWWD', 'LLWWW', 'WWWDD', 'WWWWW']
        }
#  Create a pandas dataframe
# The dict's keys become the column names and the dict's values become the column values
fb_df = pd.DataFrame(fb_dict)

# set the index to the unique values of the 'id' column - more useful than 0,1,2...
fb = fb_df.set_index('id')
fb


In [None]:
fb.index

In [None]:
fb.head(2), fb.tail(2)


In [None]:
fb.describe() # summary statistics of the numeric columns

In [None]:
fb.shape

In [None]:
fb.values # a numpy array

In [None]:
# fb[['city']]
fb.city # values of a column.  Note the  dot syntax


In [None]:
fb.city.unique()


In [None]:

fb.columns #  Return a list of the columns

In [None]:
# Create new columns
# Teams get 3 point for a win, 1 for a draw, none for a loss
fb['points'] = fb['won'] * 3 + fb['drawn']
fb


### Create new columns

**Exercise:** Create a new column, played, to hold the number of games played by each team.


In [None]:
# Write your code here

Show a column chart of the scores of each team 

In [None]:
fb['points'].plot(kind = 'bar')
plt.show()

### Sort rows
Order the rows in the dataframe by the points scored (low to high), then in the case of any ties, by the number of games won.

In [None]:
fb_sorted = fb.sort_values(by = ['points', 'won'], ascending=True)
# Show the sorted dataframe in a column chart
fb_sorted['points'].plot(kind = 'bar')
plt.show()

Sort fb by index


In [None]:
fb = fb.sort_values(by = 'id', ascending=True)
fb

Add a new column team_caps. Note the way this is done.  The apply method argument is the string upper method

In [None]:
fb['team_caps'] = fb['team'].apply(str.upper)
fb

The next two statements have the same effect.  They both create a column with the first five characters of the team.  Which do you prefer?

In [None]:
# Use slicing
fb['team_short'] = fb['team'].str[:5]
fb

In [None]:
# Use a list comprehension
fb['team_short2'] = [x[0:5] for x in fb['team'] if len(x) > 5]
fb

In [None]:
# To keep the dataframe tidy, drop the new columns columns we added
fb = fb.drop(['team_caps', 'team_short', 'team_short2'], axis=1)

#### Grouping

Group by city to get the total scores per city

In [None]:
fb_by_city = fb.groupby(['city'], as_index = False)
fb_by_city.sum(numeric_only=True)


### Filter rows
Filter both rows and columns of the dataframe in various ways using the loc method

In [None]:
fb.loc[:, ['city']] # all rows, one column

In [None]:
fb.loc['LIV', :] # one row, all columns, returns a series

In [None]:
fb.loc[['LIV'], :] # one row, all columns, returns a dataframe


In [None]:
fb.loc[['LIV', 'MCY'], :] #   two rows, all columns, returns a dataframe

In [None]:
fb.loc[['LIV', 'MCY'], ['team', 'won']] # two rows, two columns

In [None]:
fb.loc[:, ['team', 'won']] # all rows, two columns

In [None]:
 # Returns a series. Note the use of the <dataframe>.<column> notation
fb.won

In [None]:
# returns a Series of bools
fb.won > 5


In [None]:
# filters the rows of the dataframe where the bools are True
fb.loc[fb.won > 5, :]

In [None]:
# filters the rows of the dataframe where the city is London
fb.loc[fb.city == 'London', :]

In [None]:
# Combine two filter criteria with an & (and)
(fb.city == 'London') & (fb.won > 5)


In [None]:
# return rows of London teams that have won more than 5 games
fb.loc[(fb.city == 'London') & (fb.won > 5), :]