# --------------------------------------- Training -------------------------------------

* You can check the comments to see what each cell does. First, read the comments, and then run the cells to see the outputs.


* There are a few exercises in the notebook. The solution of some of them is given. But please first try to complete the exercise yourself and then look at the solution.


* After running all cells and completing all exercises, complete the tasks at the end of the notebook. 

## Importing necessary Libraries  

In [None]:
import pandas as pd 
import numpy as np

## Importing a dataset for this workshop

## Importing the dataset directly from the source

In [None]:
# Installing a package which is needed to download the dataset from its online source. This package is recommended by the online source at the following
# URL : https://archive.ics.uci.edu/dataset/2/adult. You need to reinstall the kernel after the package has been installed
!pip3 install ucimlrepo

In [None]:
# Downloading the dataset from the online source. The first two lines are given by the online source mentioned above
from ucimlrepo import fetch_ucirepo 
# fetch dataset 
adult = fetch_ucirepo(id=2) 

# Putting data in a pandas dataframe
X = adult.data.features 
y = adult.data.targets 
data=pd.concat([X,y],axis=1)


In [None]:
#printing data
data

## Importing the dataset from a folder on your local disk 

In [None]:
#importing the dataset as a Pandas DataFrame into Python if the dataset is stored on your local hard disk
# You can download the dataset from the following URL: https://archive.ics.uci.edu/dataset/2/adult  
data = pd.read_csv('adult.csv') # Replace the current data path with the data path on your local disk 

## Exploring the dataset

In [None]:
# Showing the first 5 rows of the dataset
data.head()

In [None]:
# Finding the shape of the data
print(data.shape)

In [None]:
# Generate a dataset by randomly extracting 30000 rows (samples) 
data_new = data.sample(n=30000, random_state = 48)

In [None]:
# Printing the new dataset
data_new

In [None]:
# The indices of different rows in the dataset are currently messy. This happens in many data science projects. Always reindex 
# the dataset if you are unsure the indices are correct.  
data_new.reset_index(drop=True, inplace=True)

In [None]:
# Checking if the indices are correct
data_new

In [None]:
data

In [None]:
# Getting statistical information of the dataset for different columns (features) 
data.describe(include="all")

In [None]:
# Showing the dataset information
data.info()

In [None]:
# Getting the count of different values in the column "education-num"
data['education-num'].value_counts()

In [None]:
# Getting the count of different values in the column "education"
data['education'].value_counts()

In [None]:
# Dropping a column
data = data.drop(['fnlwgt'], axis=1)

In [None]:
data.shape

In [None]:
# Getting the number of unique values of a column
data['education'].nunique()

In [None]:
# Finding how many rows are related to either gender 
data['sex'].value_counts()

In [None]:
# Calculating the average age of different genders in the dataset 
data['age'].groupby([data['sex']]).mean()


In [None]:
# Getting the average age of different genders in the dataset broken down based on their education
data['age'].groupby([data['sex'],data['education']]).mean()

In [None]:
# Getting the maximum age of different races in the dataset
data['age'].groupby([data['race']]).max() 

* Exercise 1: Write a code to find how much contribution each sex and occupation category made to the capital-gain on average. Apply the code to the dataset and print the result 

In [None]:
# Extracting the age and education columns and creating a new DataFrame using these columns
a=data['age']
b=data['education']
new_data=pd.concat([a,b],axis=1)

In [None]:
data

* Exercise 2: Write a function that receives the dataset and replace Famle with F and Male with M (please try to write it yourself before checking the answer in the next cell)

In [None]:
# exercise 2 solution
def encode_sex(data):
    data.reset_index(drop=True, inplace=True) # reindexing the dataset in case the dataset index is corrupted
    rows=data.shape[0]
    a=data.loc[:,'sex']
    for i in range(rows): 
        if a[i]=="Male":
            data.loc[i,"sex"]="M"
        elif a[i]=="Female":
            data.loc[i,"sex"]="F"
    return data

In [None]:
# Copying the data
data_copy=data.copy()

In [None]:
# Applying the encode_sex function to the copied data
data_encoded=encode_sex(data_copy)
data_encoded.head()

In [None]:
data

# Tasks

## P 2.1: Complete the following steps (4%): 
## 1- Import the dataset from the URL we used in this workshop. 
## 2- Generate a new dataset by randomly extracting 10000 samples. 
## 3- Drop the 'income' column, reindex the new dataset and then clean it. 
## 4- Print the new dataset and use it for the rest of the tasks

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################




## P 2.2: Complete the following steps (4%):
## 1- Determine which columns are categorical and which columns are numerical. 
## 2- Encode the categorical columns using the correct method. 
## 3- Normalise the numerical columns. 
## 4- Print the dataset 

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################


## P 2.3: Write a function to split the dataset in half column-wise and swap the first half and the second half. Apply the funtion to the dataset and print the result (3%)

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################




## P 2.4: Write a function that receives two numerical columns' names and compare their values for all rows. If the value of the first column is greater than the second column, the function should produce True, otherwise, it should produce False. The function should append an additional column to the dataset to store the results of the comparison for all rows. Apply the function to the "age" and "hours-per-week" columns in the dataset and print the result (4%).

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################


