<a href="https://colab.research.google.com/github/JBtallgrass/UPENN_GSE_collab/blob/main/DS_Methods_Feature_Extraction_and_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DS Methods - Feature Extraction and Feature Engineering Notebook

### Use the summary dataset to investigate the impact of conversational agent's language style (Condition), whether formal or informal, on students' language usage (Formality) before and after a 3-hour intervention (Test). The relevant paper is located in the Resources folder within this module.

## Task 1. Data Cleaning

### 1.1 Install and import the Package
#### Install the packages if you haven't, using "!pip install package"

In [None]:
# !pip install pandas

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import os

#### Check the current working directory and set options to show all the columns in the data

In [None]:
os.getcwd()
pd.set_option('display.max_columns', None)

### 1.2 Upload the data and clean the data

In [None]:
from google.colab import files
dataset = files.upload()
filename = list(dataset.keys())[0]
print(f"{filename} has been uploaded")

Saving summary.csv to summary.csv
summary.csv has been uploaded


In [None]:
df = pd.read_csv("summary.csv")

#### 1.2.1 Review the data [e.g., df.head()] and retrieve the names of all the columns [e.g., df.columns.tolist()]

In [None]:
df.head(5)

Unnamed: 0,UserID,LessonID,SumRT,Condition,Age,Sex,Ethnicity,BirthCountry,StayContry,EngYear,Howlong,Edu,formality,Test
0,FORMAL1_01,lesson10,262.812,Formal,30,2,5,India,India,15,0.0,8,0.084,PreTest
1,FORMAL1_01,lesson2,122.421,Formal,30,2,5,India,India,15,0.0,8,-3.948,PostTest
2,FORMAL1_01,lesson3,125.968,Formal,30,2,5,India,India,15,0.0,8,0.567,PostTest
3,FORMAL1_01,lesson5,111.0,Formal,30,2,5,India,India,15,0.0,8,-1.263,Training
4,FORMAL1_01,lesson6,99.437,Formal,30,2,5,India,India,15,0.0,8,-1.79,Training


In [None]:
df.columns.tolist()

['UserID',
 'LessonID',
 'SumRT',
 'Condition',
 'Age',
 'Sex',
 'Ethnicity',
 'BirthCountry',
 'StayContry',
 'EngYear',
 'Howlong',
 'Edu',
 'formality',
 'Test']

#### 1.2.2 Clean the data
##### (a) Keep only formal and informal conditions based on the Condition column and check the unique values in the Condition column
##### (b) Keep only pretest and posttest based on the Test column and check the unique values in the Test column

In [None]:
# Filter the data for formal and informal conditions based on the ClassID column
def filter_data(df, condition_column, test_column):
    filtered_df = df[(df[condition_column] != 'Mixed') & (df[test_column] != 'Training')]
    return filtered_df

df_con = filter_data(df, 'Condition', 'Test')




In [None]:
# Check the unique values using this code: df['Your_Column_Name'].value_counts()
## Unique values in the test column by replacing the column name with "Condition" and using your data df_con.


In [None]:
## Unique values in the test column by replacing the column name with "Test" and using your data df_con.


## Task 2. Binning
### 2.1 Visualize age using matplotlib

In [None]:
# Plotting the distribution, using: plt.hist(df['Your_Column_Name'])
# Use your data and the column name, "Age".


# Show the plot
plt.show()


### 2.2 Create a new categorical variable by grouping age to 21-30, 31-40, and above 40

In [None]:
# Create a new categorical variable for age groups
df_con['age_group'] = pd.cut(df_con['Age'], bins=[20, 30, 40, float('inf')], labels=['21-30', '31-40', 'above 40'], right=False)

# Check the unique values of the new age group using this code: df['Your_Column_Name'].value_counts()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_con['age_group'] = pd.cut(df_con['Age'], bins=[20, 30, 40, float('inf')], labels=['21-30', '31-40', 'above 40'], right=False)


## Task 3 Transformation
### 3.1 Visualize summary writing time (SumRT)

In [None]:
# Plotting the distribution, using: plt.hist(df['Your_Column_Name'])



### 3.2 Create a new categorical variable by using log transformation, and visualize the transformed variable

In [None]:
# First, ensure there are no non-positive values in 'SumRT' as log transformation requires positive values


# Apply log transformation to 'SumRT' column
df_con['SumRT_log'] = np.log(df_con['SumRT'])



# Visualize the transformed variable


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_con['SumRT_log'] = np.log(df_con['SumRT'])


## Task 4 Encoding
### 4.1 How many levels of highest educational level (Edu)

In [None]:
# Check the unique values of the new age group using this code: df['Your_Column_Name'].value_counts()



### 4.2 Encode this varialbe using a suitable encoding method.
#### The integer in the Edu column represents different levels of highest education:
#### 3:'Some high school', 4:'High School', 5:'Some college', 7:'Associate', 8:'Bachelor', 9:'Master', 11:'Doctoral'

In [None]:
replace_edu = {3:'Some high school', 4:'High School', 5:'Some college', 7:'Associate',
               8:'Bachelor', 9:'Master', 10:'Professional', 11:'Doctoral'}
df_con['Edu_cat'] = df_con['Edu'].replace(replace_edu)


# Check the unique values of the new age group using this code: df['Your_Column_Name'].value_counts()

df_con['Edu_cat'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_con['Edu_cat'] = df_con['Edu'].replace(replace_edu)


Bachelor            204
Master              128
Doctoral             40
Some college         30
Professional         24
High School          20
Associate            20
Some high school     12
Name: Edu_cat, dtype: int64

### 4.3 Concatenate the coded variable with the orignal data

In [None]:
## Create a dummy coded variable
dummy_coded = pd.get_dummies(df_con['Edu_cat'], prefix='Edu')


# Merge the dummy coded variable to the dataset
df_con = pd.concat([df_con, dummy_coded], axis=1)



### 4.4 Address the prevalence of zeroes in the generated variables by using appropriate techniques.

In [None]:
## Bonus Creative Assignments
### 1. Use appropriate methods to handle the following variables:
#### (1) Ethnicity {1:'White', 2:'Hispanic', 3:'Black', 4:'Native American', 5:'Asian', 6:'Pacific Islander', 7:'Other'}
#### (2) Birth country (BirthCountry)
#### (3) The country where participants stay during the experiment (StayCountry)
#### (4) The years they learned English as a foreign or second language (EngYear)
#### (5) The duration of their stay in the country (Howlong)
### 2. Which features robustly predict students' use of formality?

In [None]:
# Create new educated variable as an example
replace_edu_new = {3:'No Bachelor', 4:'No Bachelor', 5:'No Bachelor', 7:'No Bachelor',
               8:'Bachelor', 9:'Master or higher', 10:'Master or higher', 11:'Master or higher'}
df_con['Edu_new'] = df_con['Edu'].replace(replace_edu_new)


dummy_coded_new = pd.get_dummies(df_con['Edu_new'], prefix='Edu_New')
dummy_coded_new

Unnamed: 0,Edu_New_Bachelor,Edu_New_Master or higher,Edu_New_No Bachelor
0,1,0,0
1,1,0,0
2,1,0,0
7,1,0,0
8,1,0,0
...,...,...,...
977,0,0,1
978,0,1,0
979,0,1,0
980,0,1,0


In [None]:
df_con = pd.concat([df_con, dummy_coded_new], axis=1)
df_con.head(5)

Unnamed: 0,UserID,LessonID,SumRT,Condition,Age,Sex,Ethnicity,BirthCountry,StayContry,EngYear,Howlong,Edu,formality,Test,age_group,SumRT_log,Edu_cat,Edu_Associate,Edu_Bachelor,Edu_Doctoral,Edu_High School,Edu_Master,Edu_Professional,Edu_Some college,Edu_Some high school,Edu_new,Edu_New_Bachelor,Edu_New_Master or higher,Edu_New_No Bachelor
0,FORMAL1_01,lesson10,262.812,Formal,30,2,5,India,India,15,0.0,8,0.084,PreTest,31-40,5.571439,Bachelor,0,1,0,0,0,0,0,0,Bachelor,1,0,0
1,FORMAL1_01,lesson2,122.421,Formal,30,2,5,India,India,15,0.0,8,-3.948,PostTest,31-40,4.807466,Bachelor,0,1,0,0,0,0,0,0,Bachelor,1,0,0
2,FORMAL1_01,lesson3,125.968,Formal,30,2,5,India,India,15,0.0,8,0.567,PostTest,31-40,4.836028,Bachelor,0,1,0,0,0,0,0,0,Bachelor,1,0,0
7,FORMAL1_01,lesson9,494.781,Formal,30,2,5,India,India,15,0.0,8,0.354,PreTest,31-40,6.204115,Bachelor,0,1,0,0,0,0,0,0,Bachelor,1,0,0
8,FORMAL1_04,lesson10,2686.001,Formal,27,1,1,USA,,0,0.0,8,-1.461,PreTest,21-30,7.895809,Bachelor,0,1,0,0,0,0,0,0,Bachelor,1,0,0
