<a href="https://colab.research.google.com/github/AmberMynott/AmberMynott-DataScience-GenAI-Submissions/blob/main/Assignment_1/2_01_data_and_feature_engineering_in_pandas_COMPLETED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://drive.google.com/uc?export=view&id=1xqQczl0FG-qtNA2_WQYuWePW9oU8irqJ)

# Data and Feature Engineering
This Notebook will introduce us to preparing our datasets for data science and AI workloads. For sturctured data, this will often utilise popular Python libraries such as Pandas, which will be the focus for this Notebook.

### What is Pandas?
As Python has grown to become a popular solution for data analysis, many new tools have been introduced to support such tasks. Pandas is arguably the most popular and widely used of these, although it does have some limitations (if working with very large datasets you may want to look at Dask and/or PySpark).

First, we will import it into our session. This effectively means we have opened the library in the background, and can now use the [functions](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_10_functions.ipynb) built into it. We will also import numpy - another widely used Python package for working with numerical data (the name is a portmanteau of "numerical" and "python"). By convention we import pandas as "pd" and numpy as "np". This is equivalent to giving it a variable name.

In [3]:
import pandas as pd
import numpy as np

### Testing our installation
To test everything is working we will create some fake data frame using numpy and load it into a pandas dataframe (more on these below). First we will create the random numbers:

In [2]:
x = np.random.rand(10,1)
x

array([[0.6746131 ],
       [0.82767163],
       [0.16552357],
       [0.57545813],
       [0.05869384],
       [0.50323401],
       [0.64023316],
       [0.00243822],
       [0.51560888],
       [0.64651795]])

The commands here have told numpy to create a set of random numbers between zero and one. The arguments we have passed, "1" and "10", tells numpy we want a 10x1 array of numbers (i.e. a table (more accurately a vector) with 10 rows and 1 column).

Next we will create a pandas dataframe using "x":

In [3]:
testdf = pd.DataFrame(x)
testdf

Unnamed: 0,0
0,0.674613
1,0.827672
2,0.165524
3,0.575458
4,0.058694
5,0.503234
6,0.640233
7,0.002438
8,0.515609
9,0.646518


We have now successfully created a pandas dataframe!

### What are Dataframes?
Pandas has a very elegant way of managing data, very much borrowed from the statistical language R, mostly based around dataframes. You can think of dataframes a bit like an Excel table with rows, columns and common operations like sum, average and so on. We can use this on top of our previous work on lists and dictionaries (etc.) more usable and malleable.

To begin with we will create a dataset - in this case a [dictionary](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_05_dictionaries.ipynb).



In [15]:
orders = {'o10001':{'date':'2024/01/10', 'product':'Hoodie', 'quantity':'1'},
            'o10002':{'date':'2024/01/13', 'product':'Tote bag', 'quantity':'2'},
            'o10003':{'date':'2024/01/14', 'product':'Pencil', 'quantity':'10'},
            'o10004':{'date':'2024/01/15', 'product':'T-shirt', 'quantity':'2'}
}
orders

{'o10001': {'date': '2024/01/10', 'product': 'Hoodie', 'quantity': '1'},
 'o10002': {'date': '2024/01/13', 'product': 'Tote bag', 'quantity': '2'},
 'o10003': {'date': '2024/01/14', 'product': 'Pencil', 'quantity': '10'},
 'o10004': {'date': '2024/01/15', 'product': 'T-shirt', 'quantity': '2'}}

We can convert this to a Dataframe with great ease

In [5]:
import pandas as pd
import numpy as np

orders_df = pd.DataFrame(orders)
orders_df

Unnamed: 0,o10001,o10002,o10003,o10004
date,2024/01/10,2024/01/13,2024/01/14,2024/01/15
product,Hoodie,Tote bag,Pencil,T-shirt
quantity,1,2,10,2


We can even create such outputs from more complex dictionaries, such as a dictionary which includes a nested list:

In [6]:
customers = {'Mark':{'name':'Mark Johnson', 'open_orders':3, 'orders':['o10001', 'o10002', 'o10004']},
             'Katy':{'name':'Katy Hoad', 'open_orders':0, 'orders':[]},
             'Angela':{'name':'Angela Lorenz', 'open_orders':1, 'orders':['o10003']},
             'Bo':{'name':'Bo Kelestyn', 'open_orders':0, 'orders':[]}
}
customers

{'Mark': {'name': 'Mark Johnson',
  'open_orders': 3,
  'orders': ['o10001', 'o10002', 'o10004']},
 'Katy': {'name': 'Katy Hoad', 'open_orders': 0, 'orders': []},
 'Angela': {'name': 'Angela Lorenz', 'open_orders': 1, 'orders': ['o10003']},
 'Bo': {'name': 'Bo Kelestyn', 'open_orders': 0, 'orders': []}}

In [7]:
customers_df = pd.DataFrame(customers)
customers_df

Unnamed: 0,Mark,Katy,Angela,Bo
name,Mark Johnson,Katy Hoad,Angela Lorenz,Bo Kelestyn
open_orders,3,0,1,0
orders,"[o10001, o10002, o10004]",[],[o10003],[]


Pandas dataframes can also be built from various other data structures and sources (including Excel files, CSVs, text files, databases and many more). For example, we can make our dataframe from a set of [lists](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_04_lists.ipynb):

In [8]:
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]
c = [True, False, True, False]

listdf = pd.DataFrame([a, b, c])
listdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,a,b,c,d
2,True,False,True,False


### EXERCISE
Try building you own dataframes from a list and/or dictionary you create. What would happen if you have an item missing from one element. E.g. if "c" in the above example only had three items - True, False, True - rather than four. Test it - does the output match your expectation?

In [10]:
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]
c = [True, False, True]

listdf = pd.DataFrame([a, b, c])
listdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,a,b,c,d
2,True,False,True,


When repeating the code from above but instead only putting 3 elements in c instead of 4, the last element of the last row in the data frame has a value "None" since there isn't a fourth element that it can put in that place.

In [18]:
#Creating a data frame from a dictionary for students and their module marks

student_marks = {'6359802': {'Course': 'Mathematics', 'Year': "First", 'Module mark': '71'},
                 '5908223': {'Course': 'Philosophy', 'Year': "First", 'Module mark': '55'},
                 '6635109': {'Course': 'History', 'Year': "Second", 'Module mark': '59'},
                 '4990887': {'Course': 'Chemistry', 'Year': "Third", 'Module mark': '87'},
                 '7823431': {'Course': 'Biology', 'Year': "Second", 'Module mark': '68'}
}

student_marks_df = pd.DataFrame(student_marks)
student_marks_df

#Swapping the rows and columns:
student_marks_df.transpose()

Unnamed: 0,Course,Year,Module mark
6359802,Mathematics,First,71
5908223,Philosophy,First,55
6635109,History,Second,59
4990887,Chemistry,Third,87
7823431,Biology,Second,68


In [5]:
#Using a dictionary that includes a nested list

modules = {'IB2AD': {'name': 'Data science and gen AI', 'num_students': '3', 'student_IDs': ['6359802', '7823431', '4990887' ]},
           'ST234': {'name': 'Games and decisions', 'num_students': '2', 'student_IDs': ['6359802', '5908223']},
           'MA241': {'name': 'Combinatorics', 'num_students': '3', 'student_IDs': ['6359802', '6635109','4990887' ]}}

modules_df = pd.DataFrame(modules)
modules_df

Unnamed: 0,IB2AD,ST234,MA241
name,Data science and gen AI,Games and decisions,Combinatorics
num_students,3,2,3
student_IDs,"[6359802, 7823431, 4990887]","[6359802, 5908223]","[6359802, 6635109, 4990887]"


In [4]:
#Creating a data frame using list

numbers = np.random.rand(10, 2) #Generating 2 columns of 10 numbers between 0 and 1
numbers_df = pd.DataFrame(numbers)
numbers_df

Unnamed: 0,0,1
0,0.107942,0.815904
1,0.578558,0.370048
2,0.098089,0.04516
3,0.268272,0.208639
4,0.192197,0.025071
5,0.094099,0.827695
6,0.797671,0.003495
7,0.296191,0.540728
8,0.774223,0.629331
9,0.306644,0.381402
