<a href="https://colab.research.google.com/github/Prajaktahz/Uni_Colab_Work/blob/main/FBA_Week_06_Python_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://www.cs.nott.ac.uk/~pszgss/teaching/nlab.png)
# FBA Computing Session Week 6:

**More Pandas**

The aim of this tutorial is to cover additional methods available within Pandas that could be useful when preprocessing the data.

We'll explore
- how to deal with missing (NaN) values;
- how encoding and decoding is performed;
- how to group and aggregate the data; and
- how to change format from wide to long and back.






**Import Libraries**

In [2]:
import pandas as pd
import numpy as np

**Step A1: How to deal with NaNs in NumPy?**

In [3]:
a = [[1,2,np.NaN],[4,5,6]]
#new_a =
np.mean(a)
# [[1,2,3],[4,5,6]]
# [[1,2,np.NaN],[4,5,6]]

nan

In [12]:
# Is there a method that can help?
new_a = np.array(a)
#new_a = new_a[np.logical_not(np.isnan(a))]
#new_a
new_data = new_a[np.isfinite(new_a)]
new_data

array([1., 2., 4., 5., 6.])

**Step A2: NaNs in Pandas?**

Let's create a sample DataFrame with missing values to work with.

In [13]:
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, np.nan, 10],
    'C': [11, 12, 13, 14, 15]
}

df = pd.DataFrame(data)

**Step A3: Identifying Missing Values**

To identify missing values in your DataFrame, you can use the `isna()` or `isnull()` function. It will return a DataFrame of the same shape, with Booleans: `True` where the values are missing (NaN) and `False` where values are not missing.

In [14]:
# Your code here
missing_values = df.isna()
print(missing_values)

       A      B      C
0  False  False  False
1  False   True  False
2   True  False  False
3  False   True  False
4  False  False  False


**Step A4: Counting Missing Values**

To count the missing values in each column, you can use the `sum()` function on the DataFrame returned from `isna()`.

In [15]:
# Your code here
missing_count = df.isna().sum()
print(missing_count)

A    1
B    2
C    0
dtype: int64


**Step A5: Removing Rows with Missing Values**

You can remove rows with missing values using the `dropna()` function.**bold text**

In [16]:
# Your code here
df_cleaned = df.dropna()
print(df_cleaned)

     A     B   C
0  1.0   6.0  11
4  5.0  10.0  15


This will remove all rows that contain at least one missing value. If you want to remove columns with missing values, you can use `df.dropna(axis=1)`.

**Step A6: Filling Missing Values**

To fill missing values, you can use the `fillna()` function. You can replace NaNs with a specific value or with the mean, median, or any other statistic.

In [17]:
df_filled = df.fillna(0)  # Replace NaNs with 0
print(df_filled)

     A     B   C
0  1.0   6.0  11
1  2.0   0.0  12
2  0.0   8.0  13
3  4.0   0.0  14
4  5.0  10.0  15


In [18]:
df.mean()
# unlike NumPy mean works with NaNs

A     3.0
B     8.0
C    13.0
dtype: float64

In [19]:
# Your code here
df_mean_filled = df.fillna(df.mean())
print(df_mean_filled)

     A     B   C
0  1.0   6.0  11
1  2.0   8.0  12
2  3.0   8.0  13
3  4.0   8.0  14
4  5.0  10.0  15


**Step A7: Interpolation**

Pandas provides interpolation methods for filling missing values. For example, to perform linear interpolation, you can use:

In [20]:
df_interpolated = df.interpolate()
print(df_interpolated)

     A     B   C
0  1.0   6.0  11
1  2.0   7.0  12
2  3.0   8.0  13
3  4.0   9.0  14
4  5.0  10.0  15


That's it! You've now learned how to deal with missing values in Python using Pandas and NumPy, including identifying, counting, removing, filling, and interpolating missing values. Apply it to your datasets.

In more complex scenarios, you might want to use more advanced imputation techniques. For example, you can use scikit-learn's SimpleImputer to fill missing values with the mean, median, or a custom strategy.

**Common Pitfall:** when to use inplace=True

Many Pandas functions can take an argument of `inplace=True`. The difference between in-place and "normal" function calls is that in-place modifies the data object directly while "normal" functions return a copy (or a view).

It's important to use `inplace=True` with care because it modifies the original object, and you cannot easily revert the changes. Make sure you have a backup of your data or a clear understanding of the consequences before using `inplace=True`. Additionally, in some cases, it can be better to create a new DataFrame or Series and assign the modified result to it instead of using `inplace=True`.

**Step B1: Create a Sample DataFrame**

Let's create a sample DataFrame with a categorical variable that we will encode and decode.

In [21]:
data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'C', 'A'],
}

df = pd.DataFrame(data)

# Encode the 'Category' column
df['Encoded_Category'] = df['Category'].map({'A': 0, 'B': 1, 'C': 2})

# Create a valid dictionary for decoding
category_dict = {0: 'A', 1: 'B', 2: 'C'}

# Decode the 'Encoded_Category' column
df['Decoded_Category'] = df['Encoded_Category'].map(category_dict)

print(df)

  Category  Encoded_Category Decoded_Category
0        A                 0                A
1        B                 1                B
2        A                 0                A
3        C                 2                C
4        B                 1                B
5        C                 2                C
6        A                 0                A


**Step B2: One-hot Encoding**

To encode categorical variables, you can use one-hot encoding. Pandas provides a convenient method called `pd.get_dummies()` for this purpose.

In [None]:
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C']
}
df = pd.DataFrame(data)
# Perform one-hot encoding

This will convert the categorical variable 'Category' into binary columns for each category.

Now, let's say you want to convert the one-hot encoded columns back to the original categorical variable.

**Step B3: Decoding the DataFrame**

To decode the DataFrame, use `idxmax(axis=1)` to find the column with the highest value (1) for each row, and then we use `str.replace()` to remove the "Category_" prefix from the column names. Finally, create a new 'Decoded_Category' column in the DataFrame with the decoded values.

In [None]:
 # Create a decoding function to map the one-hot encoded columns back to categories

# Decode the one-hot encoded columns and remove the "Category_" prefix

# Create a new 'Decoded_Category' column

print(df)

Explore other encoding options...

- Frequency (Count) Encoding: This method replaces categories with the frequency (count) of each category in the dataset. It can be useful when the frequency of categories is relevant information.

- Target Encoding (Mean Encoding): In target encoding, the categories are replaced with the mean of the target variable for each category. It's often used in classification tasks.

**Step B4: Frequency (Count) Encoding:**

Let's say you have a Data Frame with a categorical column Color, and you want to perform frequency encoding on it:

In [None]:
data = {
    'Color': ['Red', 'Blue', 'Red', 'Green', 'Blue', 'Green', 'Red', 'Red']
}

df = pd.DataFrame(data)

# Calculate the frequency of each category and create a mapping
frequency_map = df['Color'].value_counts().to_dict()

# Apply frequency encoding to the 'Color' column
df['Color_Frequency'] = df['Color'].map(frequency_map)

print(df)

In this example, the 'Color_Frequency' column will contain the count (frequency) of each color in the 'Color' column.


**Step C1: Grouping Data**

You can use the `groupby` function to group the data based on a specific column. For example, let's group the data by the 'Category' column:

In [None]:
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'A'],
    'Value': [10, 20, 15, 25, 12, 18]
}

df = pd.DataFrame(data)
grouped = df.groupby('Category')

**Step C2: Aggregating Data**

Once you have grouped the data, you can perform various aggregation operations. Some common aggregation methods include `sum()`, `mean()`, `max()`, `min()`, and `count()`. For instance, to find the sum of 'Value' within each category:

In [None]:
# Your code here

**Step C3: Multiple Aggregations**

You can perform multiple aggregations at once using the `agg` method. For instance, to find both the sum and mean of 'Value' by category:

In [None]:
# Your code here

**Step C4: Resetting the Index**

By default, the grouped column becomes the index. Reset the index and keep the results as columns:

In [None]:
# make use of .reset_index()
agg_result

Also, practice grouping by multiple columns on a dataset with more columns!

**Step D1: Wide vs Long Formats**

Let's practice with Halloween data you used for Tableau Practical. We prefered to tranform the file outside the Tableau and we will learn how to do it in Python. Still, it is possible to do that in Tableau by pivoting the table.

In [None]:
!wget -O week6_data.zip "https://drive.google.com/uc?export=download&id=1_cmPeepnFq4EDAvlnsFsFR9pNKiCo-jo"
!unzip week6_data.zip

In [None]:
import pandas as pd
df_wide = pd.read_csv('trick_or_treat.csv')
df_wide

**Step D2: Converting from Wide to Long Format**

To convert from wide to long format, you can use the `melt` function. You need to specify which columns to keep as identifier variables (e.g., 'Year') and which columns to melt into a new variable. In this example, we will melt the timeslot columns into a 'Time' column:

In [None]:
df_long = pd.melt(df_wide, id_vars=['Year'], var_name='Time', value_name='Counts')
df_long

The resulting DataFrame (**`df_long`**) will be in long format, where each row represents a unique combination of 'Year' and 'Time'.

**Step D3: Converting from Long to Wide Format**

To convert data back from long to wide format, you can use the `pivot` function. This requires specifying which columns to use as the index, columns, and values.

In [None]:
df_wide_back = df_long.pivot(index='Year', columns='Time', values='Counts').reset_index()
df_wide_back

In [None]:
# sorting rows and columns?


The `pivot` function reshapes the data back into wide format using 'Year' as the index, 'Time' as the columns, and 'Counts' as the values. The `reset_index()` function is used to restore 'Year' as a column.

**Final Remarks: Input / Output**

For example, `pd.read_csv` will read in a CSV file and create a `DataFrame`.

Getting data out again can also be done in several ways: `df.to_numpy()` to create numpy arrays or `df.to_csv("filename.csv")` to write out to a CSV file.