# PANDAS

With Pandas we will work more and more with real data. Many concepts will be familiar to you from Excel and other data-browsing applications.

Note: This badge is very long, but opens the door to the world of data science in Python.

In this notebook:

Pandas
- DataDrame = Dict + Excel

- accessing, editing and sorting IFs (.loc)

- changing with functions: mapping and applying

- creating new columns with assign, replace, drop

- grouping, indexes and renaming

- mini-diary ⭐️⭐️⭐️❓

### DataFrame - basically like a more powerful Dict + Excel 

- Keys are column names
- Values as Arrays/Lists with rows

**DataFrame is to a Dict what Array was to a List**

To create a DataFrame we put a Dict it its constructor. But remember that values need to be lists. Like this:

```
data = pd.DataFrame({'names': ['Judy', 'Kim', 'Shaz'], 'year': [1,1,2] })
   ```
   
You can access DataFrame columns, like you would variables in an object

```data.names``` or ```data['names']```


### DataFrame Subsets - get some Rows, get some Columns

In [None]:
# before we use them, we'll need to import Pandas once per notebook (it's not a problem if you import them a few times)

import numpy as np
import pandas as pd

In [None]:
data = pd.DataFrame({'names': ['Judy', 'Kim', 'Shaz', 'Sun'],
                     'study_year': [1,1,2,1], 
                     'avg_grades': [72,61,68,65] })
print(data) 
# notice the constructor function pd.DataFrame(...) basically takes a normal Dictionary!

In [None]:
# Row subset work like with Arrays, above  dataframe[start: ceiling : jump ]
print(data[1:3]) 

In [None]:
print(data[0:3:2])  # jump every 2

In [None]:
# Get a column. The old way.
# Notice meta information about names and data types!
print(data['names'])

In [None]:
# but quite frequently you would use a . dot notation, like this:
print(data.names)

In [None]:
# get individual items
print(data.names[0])
print(data.names[2])

In [None]:
# get many individual items
print(data.names[1:3])

### Modifying DataFrames

In [None]:
# change values: the old ('deprecated') way

data = pd.DataFrame({'names': ['Judy', 'Kim', 'Shaz', 'Sun'],
                     'study_year': [1,1,2,1], 
                     'avg_grades': [72,61,68,65] })

data.names[0] = "Natasha" # note that this way is likely to thrown an errror below
print(data)

In [None]:
# instead you would use the new .at or .loc operator. They do the same thing, but loc is more flexible

data = pd.DataFrame({'names': ['Judy', 'Kim', 'Shaz', 'Sun'],
                     'study_year': [1,1,2,1], 
                     'avg_grades': [72,61,68,65] })

data.loc[0, 'names'] = "Natasha" # notice the order of arguments
data.loc[2, 'avg_grades'] = 100
print(data)

In [None]:
# Adding new column to dataframe works just like adding a new key-value pair to a Dict

data = pd.DataFrame({'names': ['Judy', 'Kim', 'Shaz', 'Sun'],
                     'study_year': [1,1,2,1], 
                     'avg_grades': [72,61,68,65] })
data['department'] = ['business', 'math', 'business', 'medicine']
print(data)

In [None]:
# modify all items in a column
data = pd.DataFrame({'names': ['Judy', 'Kim', 'Shaz', 'Sun'],
                     'study_year': [1,1,2,1], 
                     'avg_grades': [72,61,68,65] })
data['study_year'] += 1 # same as data['study_year']  = data['study_year']  + 1
data['avg_grades'] = 0
print(data)
data

### **Side note about print vs implied return:** in Python notebooks there are two ways to 'show' what's going on with code

This is a notebook-specific feature that we sort of ignored until now, but now it is becoming important:

**print()** - 'show' many things, by 'printing/outputting' them below the cell. 

- You can print many things, and they will be shown in order you printed them (exaclty like with prinitng on paper)

**'implied output/return'** - you know return from functions. One thing can get returned from each function, and then the function 'ends/terminates'. In other words: the return is always the last thing that happens in the function. In notebook cells it works simmilar, but without the word 'return' - the value of the last things that happens in the notebook will be shows underneath it in an Out[] block

- Only one thing is in 'Out[]' block. It is the value of the last line in your code. 
- Some functions do not return anything (eg. print(...) does not). Also assignment = do not return anything

In [None]:
# all prints are visible, but only the result of the last line is 'Out[]'ed
print(3)
print(4)
print(5)
100
200
300

In [None]:
name = 'Shaz'
name

In [None]:
# assignment does not return anything, so nothing is 'Out[]'ed
name = 'Shaz'

In [None]:
# printing does not return anything, so nothing is 'Out[]'ed
print(name)

In [None]:
# but for example comparison does return a True/False value
name == "Banana"

In [None]:
# and here we will both print and return a value. Because why not!
print(name)
name

**Back to the topic...**

### Showing/printing DataFrames

You can ```print()``` DataFrames and they will be arranged into a nice readable format, but you can also return them (make them the last item in your Notetebook cell) and they will be displayed in an even nicer format.

In [None]:
data = pd.DataFrame({'name' : ['Judy', 'Kim', 'Shaz', 'Natt', 'Gill'],
                   'surname' : ['OBrien', 'Gunn', 'Dice', 'Johnes', 'Roy'],
                   'semester' : [1,1,2,2,1],
                   'score' : [3.7, 4.6, 8.2, 2.6, 3.7],
                   'penalty' : [0.5, 0.0, 0.8, 0.0, 0.2]})

# when printed in a cell it is simple
print("printed version:\n", data)

# when returned from a cell it is prettier
print("'returned' version") # remember: in Python notebooks, the last line of code gets 'returned' and 'interpreted'
data 

In [None]:
# you can produce some simple statistics about numeric values in your data with describe() 

data.describe()

In [None]:
# to get information abotu data types and sizes of data you can use info()

data.info()

### Sorting Dataframes

sort_values takes a number of arguments:

- ```by``` is a List of columns to sort the data by, in order. First items are sorted by first item in this list. If there is a tie, they are sorted by second item, etc.
- ```inplace``` is a True/False value indicating whether the new value should be returned, or put back into the sorted dataframe. Use ```inplace=False``` If you want to print or output data, and use ```inplace=True``` if you want to change your actual data.
- ```ascending``` takes either True/False value, or a list of True/False values (if sorting by many columns)

In [None]:
data.sort_values(by=['semester','score'],ascending=True,inplace=False)

In [None]:
data.sort_values(by=['semester','score'],ascending=[False, True],inplace=False)

In [None]:
data.sort_values(by=['score', 'penalty'],ascending=True,inplace=False)

In [None]:
# this will sort the 'data' variable, but not actually return it
data.sort_values(by=['surname'],ascending=True,inplace=True)

In [None]:
# but look, it was actually sorted by surname!
print(data)

### Removing Duplicates with drop_duplicates()

Before we learn how to remove duplicates, I will show you how to create some:

In [None]:
# multiplying items combines them. Multiplying item by x is like adding it to itself x many times

print(3 * 3)
print('3' * 3)
print([3] * 3)

In [None]:
# this can be combined with actual adding
print([1,1,1] + [2,2])
print(['sales'] * 3 + ['marketting']*5)

Let's make some duplicates and remove them:

In [None]:
door_signs = pd.DataFrame({'department':['sales'] * 3 + ['marketing']*5 + ['r&d'] * 4,
                        'floor':[1,2,3,1,2,3,3,3,1,1,1,2]})
door_signs

In [None]:
door_signs.sort_values(by='floor')

In [None]:
door_signs.drop_duplicates()

You could drop duplicates in only one column, but keep in mind that the order in which data was will become very meaningful. Have a look at below examples

In [None]:
door_signs = pd.DataFrame({'department':['sales'] * 3 + ['marketing']*5 + ['r&d'] * 4,
                        'floor':[1,2,3,1,2,3,3,3,1,1,1,2]})
offices.drop_duplicates(subset='department')

In [None]:
door_signs = pd.DataFrame({'department':['sales'] * 3 + ['marketing']*5 + ['r&d'] * 4,
                        'floor':[1,2,3,1,2,3,3,3,1,1,1,2]})
door_signs.drop_duplicates(subset='floor')

### Selecting only some items (you might have seen this before)

In [None]:
door_signs = pd.DataFrame({'department':['sales'] * 3 + ['marketing']*5 + ['r&d'] * 4,
                        'floor':[1,2,3,1,2,3,3,3,1,1,1,2]})

# pick only those on floor 1
door_signs.loc[door_signs['floor'] == 1]

In [None]:
# or above floor 1
# pick only those on floor 1
door_signs.loc[door_signs['floor'] > 1]

In [None]:
# or within a range
door_signs.loc[door_signs['department'].isin(['sales','marketing'])]

In [None]:
# or use many conditions with &
door_signs.loc[ (door_signs['department'].isin(['sales','marketing'])) & (door_signs['floor'] > 1) ]

### Mapping columns values with .map( ) 

Previously when we used map, it was a python method, into which we had to pass the list we wanted to map.

```map(mapping_function, my_list)```

That was a bit confusing, becuase methods usually are called on objects. It would make more sense for map to work like this:

```my_list.map(mapping_function)```

And Pandas give us the ability to do exacltyu that! Well... on Arrays, not Lists, but that's close enough.

```my_data_frame.map(mapping_function)```

**YOU CAN CHAIN .MAP( )** which makes certain tasks much simpler, like in 

```name.map(mapping_function_1).map(mapping_function_2)```


In [None]:
data = pd.DataFrame({'fruits' : ["banana",  "kiwi", "apple"]})

data['lengths'] = data['fruits'].map(len)

data

### But first... a short catchup on lambda functions

In [None]:
# traditional function definition looks like this:
data = pd.DataFrame({'fruits' : ["banana",  "kiwi", "apple"]})

def first_letter(word):
    return word[0]

print(first_letter("banana"))


# and then you can use this function's NAME in a map

data['first_letters'] = data['fruits'].map(first_letter)

data

Quick recap about lambda functions:

In [None]:
# 'lambda function' is python's simplified syntax for defining functions.
# they do not really have a name, but rather you can put function definition right into your code
# function definition looks like this: 
# lambda inputs: outputs

data = pd.DataFrame({'fruits' : ["banana",  "kiwi", "apple"]})

first_letter = lambda word: word[0] # this replaces function definition. But is a bit harder to read
# you sort of put a function in a variable. It can still be used like a function:
print(first_letter("banana"))


data['first_letters'] = data['fruits'].map(first_letter)

data

In [None]:
# but most commonly lambdas are used in situations where calculation is quick and simple
# and there is no need to give it a special name, or to reuse it later 

data = pd.DataFrame({'fruits' : ["banana",  "kiwi", "apple"]})

data['first_letters'] = data['fruits'].map(lambda word: word[0])

data
# this is THE MOST COMMON SCENARIO

In [None]:
# another example
offices = pd.DataFrame({'department':['sales'] * 3 + ['marketing']*5 + ['r&d'] * 4,
                        'floor':[1,2,3,1,2,3,3,3,1,1,1,2]})
offices['floor_signs'] = offices['floor'].map(lambda floor: f"Floor {floor}")
offices

In [None]:
offices = pd.DataFrame({'department':['sales'] * 3 + ['marketing']*5 + ['r&d'] * 4, 
                        'floor':[1,2,3,1,2,3,3,3,1,1,1,2]})

floor_names = {1: 'Ground Floor', 2: 'Main Floor', 3: "Roof Floor"}

offices['floor_signs'] = offices['floor'].map(lambda floor: floor_names[floor] )
offices


### Cleaning up data with .map( )

Here's an **example of chaining** ```.map( )``` to clean up data:

In [None]:
# data is so messy! Read it. Do you see any patterns? How could it be fixed?

offices = pd.DataFrame({
    'department': ['-Sales-', 'sales  ', '-SALES-', 'marketing', 'MARKETING', '-marketing-',
                   'Marketing', '  marketing', '-R&D-', ' r&d', 'r&d  ', 'r&d'],
    'floor':[1,2,3,1,2,3,3,3,1,1,1,2]
})
offices

In [None]:
department_codes = {'sales': "SAL", 'marketing': "MAR", 'r&d': "RND" }

# Let's create lambdas which will clean it up, one source of messiness at a time:
remove_space_and_dash = lambda word: word.strip().strip('-')
to_lower_case = lambda word: word.lower()
dept_name_to_code = lambda dept_name: department_codes[dept_name]

# mapping - add new column, with cleaned up values
offices['dept_code'] = offices['department'].map(remove_space_and_dash).map(to_lower_case).map(dept_name_to_code)
offices

### DataFrame and Higher Order Functions: .map( ) and .applymap( ) and  .apply( ) 

All of these three do something very simmilar, but in simplest terms:

- **map( )** works with column items, one at a time. Use it to represent each item as another item.
- **applymap( )** works with whole DataFrame (all columns). like above, but more columns.
- **apply( )** access each row, or columns, and represent it as one value (reduce it to one value) 

In [None]:
# here's a silly example:
# map() changes column 

people = pd.DataFrame({
    'firstname': ['Jill', 'Ryu', 'Rashni'],
    'lastname':['McDoughal', 'Kawasaki', 'Ng']
})

to_upper_case = lambda word: word.upper()
people['firstname'] = people['firstname'].map(to_upper_case)
people

In [None]:
# here's a silly example:
# applymap() changes all columns 

people = pd.DataFrame({
    'firstname': ['Jill', 'Ryu', 'Rashni'],
    'lastname':['McDoughal', 'Kawasaki', 'Ng']
})

to_upper_case = lambda word: word.upper()
people = people.applymap(to_upper_case)
# this could also be written with lambda right in the map:
# people = people.applymap(lambda word: word.upper())

people

In [None]:
# here's a silly example:
# applymap() changes all columns 
# note: function takes the column as input

people = pd.DataFrame({
    'firstname': ['Jill', 'Ryu', 'Rashni'],
    'lastname':['McDoughal', 'Kawasaki', 'Ng']
})

join_names = lambda column: f"{column['firstname']} {column['lastname']}"

people['fullname'] = people.apply(join_names, axis='columns')
people

And another set of examples:

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs']
})

def shortened(word):
    if len(word) >= 10:
        return f"{word[0:7]}..."
    else:
        return word
    
    
# note, above function could also be written with a ternary operator:
# shortened = lambda word: word if len(word) < 10 else f"{word[0:7]}..."


# map is used on a column
foods['label'] = foods['name'].map(shortened) # just one column
foods

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs']
})


new_foods = foods.applymap(to_upper_case) # applymap is used on a while dataframe
new_foods

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20]
})


string_from_row = lambda column: f"{column['name']} from {column['supplier']} suits {column['diet']} diet"

# apply is used to create a new column, but has access to all columns 
foods['label'] = foods.apply(string_from_row, axis='columns')
foods

### But what does the 'axis=' do in apply function?

it describes the dimension in which function is applier (i.e. 1 == rows,  0 == columns)

In [None]:
numbers = pd.DataFrame([[11,22,33]] *5, columns=['ones','twos','threes'])
numbers

In [None]:
numbers.apply(np.sum, axis = 0) # this returns a new 1d array. axis 0 means columns

In [None]:
numbers.apply(np.sum, axis = 1) # this returns a new 1d array. axis 1 means rows

# More of data cleaning and preparation:

### Calculate a new column from rows with ```.assign( )```

In [None]:
import numpy as np
import pandas as pd

In [None]:
# "Returns a new object with all original columns in addition to new ones. 
# Existing columns that are re-assigned will be overwritten."

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20]
})

foods = foods.assign(student_price = foods['price']*0.9, available = True)
foods

This one is special :D For a change, **assign( ) does not change the original dataframe**. That's because if you specified 'inplace=True' it would just add a column called 'inplace' and put values True in every row of that column. It's just a peculiar price we need to pay for the power of assign.

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20]
})

foods.assign(student_price = foods['price']*0.9, inplace=True) # this will go wrong!
foods

In [None]:
foods = foods.assign(student_price = foods['price']*0.9, inplace=True) 
#let's try again, and catch result. inplace does something unexpected
foods

### Delete a column with ```.drop( )```

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20]
})

foods.drop('supplier',axis='columns',inplace=True)
foods

### Replace some data with ```.replace( )```

Note: **NaN** stands for "Not a Number" and is sort of like **None**

In [None]:
# Replace in ALL COLUMNS

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
})

# replace value 0 with NaN 
foods.replace(0, np.nan, inplace=True)
foods

In [None]:
# Replace in SELECTED COLUMNS

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
})

# replace value 0 with NaN - IN ALL COLUMNS
foods['sold_since_year'].replace(0, np.nan, inplace=True)
foods

In [None]:
# Replace many items with one replacement

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
})

foods.replace(['Vegetarian','Vegan'],'No Meat',inplace=True)
foods

In [None]:
# Replace many items with many replacements

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
})

foods.replace(['Vegetarian','Vegan', 'Meat'],['VEGE', 'VEGA', 'MEAT'],inplace=True)
foods

In [None]:
# Replace many items with many replacements. Use a dictionary.
# VERY USEFUL FOR LOOKUPS! eg. when you have a table that translates eg. id into name

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0],
    'delivery':['Monday', 'Saturday', 'Saturday', 'Tuesday', 'Sunday', 'Wednesday'],

})

foods.replace({'Saturday': 'Weekend', 'Sunday': 'Weekend'},inplace=True)
foods

### Simple statistics for all data and grouped by a value, using  groupby()

Dataframe provide a full set of all statistical methods. If you need something specific, always look in the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
})

# mean of all prices
print( foods['price'].mean() )

In [None]:
# mean of all prices, grouped by diet
print( foods['price'].groupby(foods['diet']).mean() )

In [None]:
print( foods['price'].groupby(foods['diet']).max() )

In [None]:
print( foods['price'].groupby(foods['diet']).median() )

### Index - the most important part of your data (should be unique, but does not have to)

If you do not specify the index in your data, python will just use continuous numbers starting from 0 (like 0,1,2,3,4,...). Have a look at the dataframes you created before. Index is that number to the left. It's sort of like a row name in Excel.

```.set_index(a_column_name)``` will set a column with name a_column_name to be the index

```drop=False``` will make the old column stay (it will sort-of get duplicated and you'd have two identical columns: the original one, and the new index column)

You could also have many columns act as  indexes, but we will not go into that. If you wanted to do that, just pass a List of column names to set_index rather than one column name.

In [None]:
# first with no index. Noice numbers on the left-hand size
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
})

foods

In [None]:
# turn one of the columns into an index
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
}).set_index('name')

foods

In [None]:
# ... and also keep that column as it was, without drop = False
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0],
    'sold_since_year': [2018, 2015, 0, 2012, 0, 0]
}).set_index('name', drop=False)

foods

In [None]:
# so now you can use your index to get whole rows from the dataframe
# this is a bit cleaner than indexes 1,2,3,4... depending on your data
foods.loc[['Bagel']]

### Renaming columns and rows

You can rename rows, columns, or both. Just specify a dict where key is the OLD VALUE, and value is the NEW VALUE.

```{old_value: new_value, old_value_2: new_value_2}```


In [None]:
# here we change names of some rows. Basically it means that we change values in the index
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0]
}).set_index('name',  drop=False)

# rename NAMES of rows
foods.rename(index = {'Bagel':'BAGEL', 'Tap Water':'WATER'}, inplace=True)
foods

In [None]:
# here we change names of some columns.
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0]
}).set_index('name',  drop=False)

# rename NAMES of columns and rows. (not values)
foods.rename(columns={'supplier':'from', 'price':'pounds'},inplace=True)
foods

In [None]:
#You can also do both at the same time
# AND you can use string functions like str.upper, str.title, etc

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0]
}).set_index('name', drop=True)

foods.rename(index = str.upper, columns=str.title ,inplace=True)
foods

In [None]:
# And just like when you use str. functions above, you can use your own functions
# by defining them before, or using 'on the spot' lambda functions

# You can also use your own lambda functions for mapping old value to new value

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', 'Vegetarian', 'Vegan', 'Meat', 'Vegetarian', 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', 'Water Tap'],
    'price':[4.30, 2.10, 0.7, 5.70, 3.20, 0]
}).set_index('name', drop=True)


# in example below we use lambda function for index, and pre-defined function for columns
# just so you see you can do both

def add_the(word):
    return f"The {word}"

foods.rename(index = (lambda name: name[0:3]), columns=add_the ,inplace=True)
foods

# Categories - Grouping results by a range of values. Use ```pd.data.cut( data, bins, labels )``` 

Often we want to categorise our data into particular groups by value. Given a set of values, we want to decide in which range they belong.

Imagine a bunch of student exam scores (70,54,40,66) that we want to translate into grades (A,B,C,D,F).

We will need a key of where one grade ends and another starts. One way to call them are bins (like buckets/containers) and our task is put each score in one of these bins.

- F is (0, 40]
- D is (40, 50]
- C is (50, 60]
- B is (60, 70]
- A is (70, 100]

Note: 

- '(' means the value is included in the bin
- '[' means the value is excluded

In panda you could describe it as ```[(0, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 100]]```

In [None]:
import pandas as pd
import numpy as np

student_scores = [40,42,46,54,60,63,66,70, 72]

bins = [0,40,50,60,70,100]
grades = ["F","D","C","B","A"]

# note: labels (eg. grades) are optional, but very useful

# cut will categorise
categories = pd.cut(student_scores, bins, labels=grades, right=False)
print(categories) # print shows how are values are categorised, and also what do categories mean

In [None]:
# if you want the rightmost elements to be included in the smaller category (eg. for score 40 to be an 'F')
# use right=True argument, or just no right argument (True is a default)
categories = pd.cut(student_scores, bins, labels=grades, right=True)
print(categories)
print()

When you use ```pd.cut(student_scores, bins, labels=["F","D","C","B","A"])``` the resulting object contains information about 

- which category each of your data oints belongs to
- what are the categories and their boundaries

In [None]:
student_scores = [40,54,60,66,70]
bins = [0,40,50,60,70,100]
labels = ["F","D","C","B","A"]
categories = pd.cut(student_scores, bins, labels=labels)

print(categories)
print()
print( categories.tolist() )
print(categories.codes) # in older versions this was called .labels

In [None]:
print(pd.value_counts(categories))

In [None]:
# there are times when cummulative sum is very useful. it will add all items until now, but order is peculiar
print(pd.value_counts(categories).cumsum())

### AND FINALLY: Add a new column with bin values. Very useful for labeling

In [None]:
data = pd.DataFrame( {'student_scores': [40,54,60,66,70]} )
bins = [0,40,50,60,70,100]
labels = ["F","D","C","B","A"]

data['grade'] = pd.cut(data['student_scores'], bins=bins, labels=labels)
data

### Date ranges

In [None]:
1

In [None]:
dates = pd.date_range(start = '20200101', end = '20220101',periods=4)
dates

In [None]:
dates = pd.date_range(start = '20200101',freq='M',periods=4)
dates

### Quickly create some fake data:

In [None]:
# generate some fake data quickly: combine np.arange( ) with .reshape()

# grab 12 numbers
np.arange(12)

In [None]:
# folde them into a 2D array
np.arange(12).reshape((3, 4))

In [None]:
# and put some meaning to them
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Glasgow', 'Inverness', 'St Andrews'],
                    columns=['Jan', 'Feb','Mar','Apr'])
data

In [None]:
# this is even better with random numbers, eg
np.random.randint(low = 20, high = 28, size = 12)

In [None]:
raining_days = pd.DataFrame(np.random.randint(low = 20, high = 28, size = 12).reshape((3, 4)),
                    index=['Glasgow', 'Inverness', 'St Andrews'],
                    columns=['Jan', 'Feb','Mar','Apr'])
raining_days

## ⭐️⭐️⭐️💥 What you learned in this session: Three stars and a wish 
**In your own words** write in your Learn diary:

- 3 things you yould like to remember from this badge
- 1 thing you wish to understand better in the future or a question you'd like to ask
