<a href="https://colab.research.google.com/github/Cullen-hub/IB2AD0_Data_Science_GenerativeAI/blob/main/2_01_data_and_feature_engineering_in_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://drive.google.com/uc?export=view&id=1xqQczl0FG-qtNA2_WQYuWePW9oU8irqJ)

# Data and Feature Engineering
This Notebook will introduce us to preparing our datasets for data science and AI workloads. For sturctured data, this will often utilise popular Python libraries such as Pandas, which will be the focus for this Notebook.

### What is Pandas?
As Python has grown to become a popular solution for data analysis, many new tools have been introduced to support such tasks. Pandas is arguably the most popular and widely used of these, although it does have some limitations (if working with very large datasets you may want to look at Dask and/or PySpark).

First, we will import it into our session. This effectively means we have opened the library in the background, and can now use the [functions](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_10_functions.ipynb) built into it. We will also import numpy - another widely used Python package for working with numerical data (the name is a portmanteau of "numerical" and "python"). By convention we import pandas as "pd" and numpy as "np". This is equivalent to giving it a variable name.

In [None]:
import pandas as pd
import numpy as np

### Testing our installation
To test everything is working we will create some fake data frame using numpy and load it into a pandas dataframe (more on these below). First we will create the random numbers:

In [None]:
x = np.random.rand(10,1)
x

The commands here have told numpy to create a set of random numbers between zero and one. The arguments we have passed, "1" and "10", tells numpy we want a 10x1 array of numbers (i.e. a table (more accurately a vector) with 10 rows and 1 column).

Next we will create a pandas dataframe using "x":

In [None]:
testdf = pd.DataFrame(x)
testdf

We have now successfully created a pandas dataframe!

### What are Dataframes?
Pandas has a very elegant way of managing data, very much borrowed from the statistical language R, mostly based around dataframes. You can think of dataframes a bit like an Excel table with rows, columns and common operations like sum, average and so on. We can use this on top of our previous work on lists and dictionaries (etc.) more usable and malleable.

To begin with we will create a dataset - in this case a [dictionary](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_05_dictionaries.ipynb).



In [None]:
orders = {'o10001':{'date':'2024/01/10', 'product':'Hoodie', 'quantity':'1'},
            'o10002':{'date':'2024/01/13', 'product':'Tote bag', 'quantity':'2'},
            'o10003':{'date':'2024/01/14', 'product':'Pencil', 'quantity':'10'},
            'o10004':{'date':'2024/01/15', 'product':'T-shirt', 'quantity':'2'}
}
orders

We can convert this to a Dataframe with great ease

In [None]:
import pandas as pd
import numpy as np

orders_df = pd.DataFrame(orders)
orders_df

We can even create such outputs from more complex dictionaries, such as a dictionary which includes a nested list:

In [None]:
customers = {'Mark':{'name':'Mark Johnson', 'open_orders':3, 'orders':['o10001', 'o10002', 'o10004']},
             'Katy':{'name':'Katy Hoad', 'open_orders':0, 'orders':[]},
             'Angela':{'name':'Angela Lorenz', 'open_orders':1, 'orders':['o10003']},
             'Bo':{'name':'Bo Kelestyn', 'open_orders':0, 'orders':[]}
}
customers

In [None]:
customers_df = pd.DataFrame(customers)
customers_df

Pandas dataframes can also be built from various other data structures and sources (including Excel files, CSVs, text files, databases and many more). For example, we can make our dataframe from a set of [lists](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_04_lists.ipynb):

In [None]:
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]
c = [True, False, True, False]

listdf = pd.DataFrame([a, b, c])
listdf

### EXERCISE
Try building you own dataframes from a list and/or dictionary you create. What would happen if you have an item missing from one element. E.g. if "c" in the above example only had three items - True, False, True - rather than four. Test it - does the output match your expectation?

In [41]:
import pandas as pd
import numpy as np

Dict = {1 :{'Name': 'Ben','Animal': 'Dog', 'Age': '19'},
        2 :{'Name': 'Bob','Animal': 'Cat', 'Age': '20', 'Gender': 'Male'},
        3 :{'Name': 'Sam', 'Animal': 'Dog', 'Age': '18', 'Gender': 'Female'},
        4 :{'Name': 'Tom', 'Animal': 'Cat', 'Age': '17', 'Gender': 'Male'},
        5:{'Name': 'Tim', 'Animal': 'Dog', 'Age': '16', 'Gender': 'Female'}
}

In [42]:
df = pd.DataFrame(Dict)
df = df.transpose()
df

Unnamed: 0,Name,Animal,Age,Gender
1,Ben,Dog,19,
2,Bob,Cat,20,Male
3,Sam,Dog,18,Female
4,Tom,Cat,17,Male
5,Tim,Dog,16,Female


In [34]:
df.reset_index(inplace = True)
df

Unnamed: 0,index,Name,Animal,Age,Gender
0,1,Ben,Dog,19,Male
1,2,Bob,Cat,20,Male
2,3,Sam,Dog,18,Female
3,4,Tom,Cat,17,Male
4,5,Tim,Dog,16,Female


In [36]:
#df.reset_index(inplace = True)
df.rename(columns = {'index':'id'}, inplace = True)
df
#df.set_index('id')
#print(df.index)

Unnamed: 0,id,Name,Animal,Age,Gender
0,1,Ben,Dog,19,Male
1,2,Bob,Cat,20,Male
2,3,Sam,Dog,18,Female
3,4,Tom,Cat,17,Male
4,5,Tim,Dog,16,Female


In [37]:
print(df.columns)

Index(['id', 'Name', 'Animal', 'Age', 'Gender'], dtype='object')


In [39]:
df.set_index('id')

Unnamed: 0_level_0,Name,Animal,Age,Gender
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Ben,Dog,19,Male
2,Bob,Cat,20,Male
3,Sam,Dog,18,Female
4,Tom,Cat,17,Male
5,Tim,Dog,16,Female
