<a href="https://colab.research.google.com/github/GonzaloGSL06/GonzaloGSL-DataScience-GenAI-Submissions/blob/main/Assignment_1/2_01_data_and_feature_engineering_in_pandas_COMPLETED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://drive.google.com/uc?export=view&id=1xqQczl0FG-qtNA2_WQYuWePW9oU8irqJ)

# Data and Feature Engineering
This Notebook will introduce us to preparing our datasets for data science and AI workloads. For sturctured data, this will often utilise popular Python libraries such as Pandas, which will be the focus for this Notebook.

### What is Pandas?
As Python has grown to become a popular solution for data analysis, many new tools have been introduced to support such tasks. Pandas is arguably the most popular and widely used of these, although it does have some limitations (if working with very large datasets you may want to look at Dask and/or PySpark).

First, we will import it into our session. This effectively means we have opened the library in the background, and can now use the [functions](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_10_functions.ipynb) built into it. We will also import numpy - another widely used Python package for working with numerical data (the name is a portmanteau of "numerical" and "python"). By convention we import pandas as "pd" and numpy as "np". This is equivalent to giving it a variable name.

In [None]:
import pandas as pd
import numpy as np

### Testing our installation
To test everything is working we will create some fake data frame using numpy and load it into a pandas dataframe (more on these below). First we will create the random numbers:

In [None]:
x = np.random.rand(10,1)
x

array([[0.4230721 ],
       [0.7993314 ],
       [0.98077692],
       [0.63997171],
       [0.47400983],
       [0.50588494],
       [0.23967347],
       [0.58784005],
       [0.24186738],
       [0.54888849]])

The commands here have told numpy to create a set of random numbers between zero and one. The arguments we have passed, "1" and "10", tells numpy we want a 10x1 array of numbers (i.e. a table (more accurately a vector) with 10 rows and 1 column).

Next we will create a pandas dataframe using "x":

In [None]:
testdf = pd.DataFrame(x)
testdf

Unnamed: 0,0
0,0.423072
1,0.799331
2,0.980777
3,0.639972
4,0.47401
5,0.505885
6,0.239673
7,0.58784
8,0.241867
9,0.548888


We have now successfully created a pandas dataframe!

### What are Dataframes?
Pandas has a very elegant way of managing data, very much borrowed from the statistical language R, mostly based around dataframes. You can think of dataframes a bit like an Excel table with rows, columns and common operations like sum, average and so on. We can use this on top of our previous work on lists and dictionaries (etc.) more usable and malleable.

To begin with we will create a dataset - in this case a [dictionary](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_05_dictionaries.ipynb).



In [None]:
orders = {'o10001':{'date':'2024/01/10', 'product':'Hoodie', 'quantity':'1'},
            'o10002':{'date':'2024/01/13', 'product':'Tote bag', 'quantity':'2'},
            'o10003':{'date':'2024/01/14', 'product':'Pencil', 'quantity':'10'},
            'o10004':{'date':'2024/01/15', 'product':'T-shirt', 'quantity':'2'}
}
orders

{'o10001': {'date': '2024/01/10', 'product': 'Hoodie', 'quantity': '1'},
 'o10002': {'date': '2024/01/13', 'product': 'Tote bag', 'quantity': '2'},
 'o10003': {'date': '2024/01/14', 'product': 'Pencil', 'quantity': '10'},
 'o10004': {'date': '2024/01/15', 'product': 'T-shirt', 'quantity': '2'}}

We can convert this to a Dataframe with great ease

In [None]:
import pandas as pd
import numpy as np

orders_df = pd.DataFrame(orders)
orders_df

Unnamed: 0,o10001,o10002,o10003,o10004
date,2024/01/10,2024/01/13,2024/01/14,2024/01/15
product,Hoodie,Tote bag,Pencil,T-shirt
quantity,1,2,10,2


We can even create such outputs from more complex dictionaries, such as a dictionary which includes a nested list:

In [None]:
customers = {'Mark':{'name':'Mark Johnson', 'open_orders':3, 'orders':['o10001', 'o10002', 'o10004']},
             'Katy':{'name':'Katy Hoad', 'open_orders':0, 'orders':[]},
             'Angela':{'name':'Angela Lorenz', 'open_orders':1, 'orders':['o10003']},
             'Bo':{'name':'Bo Kelestyn', 'open_orders':0, 'orders':[]}
}
customers

{'Mark': {'name': 'Mark Johnson',
  'open_orders': 3,
  'orders': ['o10001', 'o10002', 'o10004']},
 'Katy': {'name': 'Katy Hoad', 'open_orders': 0, 'orders': []},
 'Angela': {'name': 'Angela Lorenz', 'open_orders': 1, 'orders': ['o10003']},
 'Bo': {'name': 'Bo Kelestyn', 'open_orders': 0, 'orders': []}}

In [None]:
customers_df = pd.DataFrame(customers)
customers_df

Pandas dataframes can also be built from various other data structures and sources (including Excel files, CSVs, text files, databases and many more). For example, we can make our dataframe from a set of [lists](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_04_lists.ipynb):

In [None]:
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]
c = [True, False, True, False]

listdf = pd.DataFrame([a, b, c])
listdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,a,b,c,d
2,True,False,True,False


### EXERCISE
Try building you own dataframes from a list and/or dictionary you create. What would happen if you have an item missing from one element. E.g. if "c" in the above example only had three items - True, False, True - rather than four. Test it - does the output match your expectation?



In [None]:
# Exercise Solution: Building DataFrames and Testing Missing Data Behavior

import pandas as pd
import numpy as np

print("=" * 60)
print("PART 1: Creating DataFrames from Different Data Structures")
print("=" * 60)

In [None]:
# 1.1: DataFrame from NumPy array
print("\n1.1 Creating DataFrame from NumPy Random Array")
print("-" * 50)

# Generate random array with 15 rows and 1 column
x = np.random.rand(15, 1)
print(f"Generated random array with shape: {x.shape}")
print("\nFirst 5 values:")
print(x[:5])

In [None]:
# Convert to DataFrame with custom column name
testdf = pd.DataFrame(x, columns=['Random_Values'])
print("\nDataFrame created successfully:")
print(f"Shape: {testdf.shape}")
print(f"\nStatistics:\n{testdf.describe()}")
print(f"\nFirst 5 rows:")
testdf.head()

In [None]:
# 1.2: DataFrame from Dictionary
print("\n1.2 Creating DataFrame from Dictionary (Real Madrid Top Scorers)")
print("-" * 50)

Top_GA_Real_Madrid_so_far = {
    'Mbappé': {'Goals': 9, 'Assists': 2, 'Total': 11, 'Minutes_Played': 1234},
    'Vinicius Jr': {'Goals': 5, 'Assists': 4, 'Total': 9, 'Minutes_Played': 1156},
    'Arda Guler': {'Goals': 3, 'Assists': 3, 'Total': 6, 'Minutes_Played': 847}
}

# Create DataFrame and transpose for better readability
Top_GA_Real_Madrid_so_far_df = pd.DataFrame(Top_GA_Real_Madrid_so_far).T
Top_GA_Real_Madrid_so_far_df.index.name = 'Player'

print("\nDataFrame created:")
print(Top_GA_Real_Madrid_so_far_df)

# Add calculated field: Goals per 90 minutes
Top_GA_Real_Madrid_so_far_df['Goals_per_90'] = (
    Top_GA_Real_Madrid_so_far_df['Goals'] / 
    Top_GA_Real_Madrid_so_far_df['Minutes_Played'] * 90
).round(2)

print("\nWith calculated field (Goals per 90 minutes):")
Top_GA_Real_Madrid_so_far_df

In [None]:
Top_GA_Real_Madrid_so_far_df = pd.DataFrame(Top_GA_Real_Madrid_so_far)
Top_GA_Real_Madrid_so_far_df

Unnamed: 0,Mbappé,Vinicius Jr,Arda Guler
Goals,9,5,3
Assists,2,4,3
Total,11,9,6


In [None]:
print("\n" + "=" * 60)
print("PART 2: Testing Missing Data Behavior")
print("=" * 60)

# 2.1: Lists with equal lengths (WORKS)
print("\n2.1 Lists with EQUAL lengths (Expected: Success)")
print("-" * 50)
a = [1, 2, 3, 4]
b = [0, 1, 3, 4]
c = [True, False, True, False]

try:
    listdf_equal = pd.DataFrame([a, b, c], index=['Row_A', 'Row_B', 'Row_C'])
    print("✓ DataFrame created successfully!")
    print(listdf_equal)
except Exception as e:
    print(f"✗ Error: {e}")

# 2.2: Lists with DIFFERENT lengths (FAILS)
print("\n2.2 Lists with DIFFERENT lengths (Expected: Error)")
print("-" * 50)
a = [1, 2, 3, 4]
b = [0, 1, 3, 4]
c = [True, False, True]  # Only 3 items instead of 4

try:
    listdf_unequal = pd.DataFrame([a, b, c])
    print("✓ DataFrame created successfully!")
    print(listdf_unequal)
except Exception as e:
    print(f"✗ Error occurred: {type(e).__name__}")
    print(f"   Message: {e}")
    print("\n   Explanation: All rows must have the same number of columns")

# 2.3: Dictionary with missing values (WORKS - fills with NaN)
print("\n2.3 Dictionary with MISSING values (Expected: NaN fill)")
print("-" * 50)
players_incomplete = {
    'Player_1': {'Goals': 10, 'Assists': 5, 'Team': 'Real Madrid'},
    'Player_2': {'Goals': 8, 'Team': 'Barcelona'},  # Missing 'Assists'
    'Player_3': {'Goals': 7, 'Assists': 3}  # Missing 'Team'
}

try:
    players_df = pd.DataFrame(players_incomplete).T
    print("✓ DataFrame created successfully!")
    print(players_df)
    print(f"\n   Note: Missing values are filled with NaN (Not a Number)")
    print(f"   Total NaN values: {players_df.isna().sum().sum()}")
except Exception as e:
    print(f"✗ Error: {e}")

# 2.4: Handling NaN values
print("\n2.4 Handling NaN values")
print("-" * 50)
print("Option A - Fill with default value:")
print(players_df.fillna('Unknown'))

print("\nOption B - Drop rows with NaN:")
print(players_df.dropna())

print("\nOption C - Check for NaN:")
print(players_df.isna())