## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

### Instructions

* **Download this notebook** as you would any other ipynb file
* **Upload** to Google Colab or work locally (if you have that set-up)
* **Delete `raise NotImplementedError()`**
* Write your code in the `# YOUR CODE HERE` space
* **Execute** the Test cells that contain `assert` statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
* **Save** your notebook when you are finished
* **Download** as a `ipynb` file (if working in Colab)
* **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

# Lambda School Data Science - Unit 1 Sprint 1 Module 2

## Making Features

### Module Learning Objectives

- Understand the purpose of feature engineering
- Work with strings in pandas
- Work with dates and times in pandas
- Modify or create columns of a dataframe using the `.apply()` function

### Resources

- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations - eg. .strip(), .split(), .replace() and list comprehensions
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series - lots more details about time using and not using Pandas
- [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)

### Notebook points: 11

### Introduction

This project is going to focus on learning how to create new features from exisiting data. There are many ways to create features in your datasets; we'll only focus on a few here. Let's get started!

**Task 1** - Load the data set

* Using the provided URL, load the Ames Housing Data into a DataFrame called `house`
* View your DataFrame

In [270]:
# Task 1

# Imports
import pandas as pd

# Dataset URL

data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Housing/Ames%20Iowa%20Housing%20Data.csv'

# YOUR CODE HERE
house = pd.read_csv(data_url)

# View your DataFrame
house.head()
house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

**Task 1 Test**

In [271]:
# Task 1 - Test

# These tests are for you to check your work before submitting
assert isinstance(house, pd.DataFrame), 'Have you created a DataFrame named house?'

# Hidden tests - you'll see results when you submit to Canvas

**Task 2** - Create a new feature

Now that we have the data loaded, we're going to create a new feature. We can use the some of the existing features, in this case, the total number of bathrooms in one column.

* Create a new variable called `Total_Bathrooms` that contains the total number of full and half bathrooms in the house.  

*Hint: Identify all the columns with bath in the title and add their values together. There are four columns to add together.  Add one to the total for each full or half bathroom.*

In [272]:
# Task 2

# YOUR CODE HERE
house['Total_Bathrooms'] = house['BsmtFullBath'] + house['BsmtHalfBath'] + house['FullBath'] + house['HalfBath']

**Task 2 Test**

In [273]:
# Task 2 - Test

assert 'Total_Bathrooms' in house.columns, 'Did you add the new column?'

# Hidden tests - you'll see results when you submit to Canvas

# View all the bathroom columns
house[['Total_Bathrooms','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath']].head()

Unnamed: 0,Total_Bathrooms,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath
0,4,1,0,2,1
1,3,0,1,2,0
2,4,1,0,2,1
3,2,1,0,1,0
4,4,1,0,2,1


**Task 3** - Create a new feature using encoding

Next we're going to create another new column by "encoding" an exisitng column. Using the `LotArea` column, we'll assign values according the the area of the lot.

* Create a new column called `Large_Lot` that takes on the following values:
    * 0 when `LotArea` is **less** than 10,000
    * 1 when `LotArea` is **greater than or equal** to 10,000

In [274]:
# Task 3

# YOUR CODE HERE
large_lot = house['LotArea'].copy()

greater = large_lot >= 10000
less = large_lot < 10000
large_lot[greater] = 1
large_lot[less] = 0

house['Large_Lot'] = large_lot
# View the new column
house[['LotArea', 'Large_Lot']].head()


Unnamed: 0,LotArea,Large_Lot
0,8450,0
1,9600,0
2,11250,1
3,9550,0
4,14260,1


**Task 3 Test**

In [275]:
# Task 3 - Test

assert 'Large_Lot' in house.columns, 'Did you add the Large_Lot column?'

# Hidden tests - you'll see results when you submit to Canvas

**Task 4** - Create another feature

Let's continue with creating new features. This time, we'll focus on encoding for the sale month.

* Create a new column called `Summer_Sale` that takes on the following values:
    * 0 if the sale month was in September - May (`ModSold` = 1-5, 9-12)
    * 1 if the sale month was in June, July or August (`MoSold` = 6, 7 or 8)

In [276]:
# Task 4

# YOUR CODE HERE
summer_sale = house['MoSold'].copy()
sep_may = summer_sale.isin([1, 2, 3, 4, 5, 9, 10, 11, 12])
jun_aug = summer_sale.isin([6, 7, 8])
summer_sale[sep_may] = 0
summer_sale[jun_aug] = 1

house['Summer_Sale'] = summer_sale
# View the new column
house[['MoSold', 'Summer_Sale']]


Unnamed: 0,MoSold,Summer_Sale
0,2,0
1,5,0
2,9,0
3,2,0
4,12,0
...,...,...
1455,8,1
1456,2,0
1457,5,0
1458,4,0


**Task 4 Test**

In [277]:
# Task 4 - Test

assert 'Summer_Sale' in house.columns, 'Did you add the Summer_Sale column?'

# Hidden tests - you'll see results when you submit to Canvas

**Task 5**

Now we'll revisit the LendingClub data. The statements to import the `loans` DataFrame have been provided for you.  Run the code block below without changing anything to load the `loans` data.

* Load the data into a DataFrame called `loans`.
* Make sure to view the `loans` DataFrame to see what data we're working with.

In [278]:
# Task 5

# Dataset URL

# Dataset URL
loans_data = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/LendingClub/LoanStats_2018Q4_sm.csv'

# YOUR CODE HERE
loans = pd.read_csv(loans_data)

# View the dataset!
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,5525,5525,5525.0,36 months,10.72%,180.15,B,B2,...,,,,N,,,,,,
1,,,10000,10000,10000.0,36 months,10.08%,323.05,B,B1,...,,,,N,,,,,,
2,,,12000,12000,12000.0,60 months,10.08%,255.44,B,B1,...,,,,N,,,,,,
3,,,20000,20000,20000.0,36 months,6.46%,612.62,A,A1,...,,,,N,,,,,,
4,,,12000,12000,12000.0,36 months,7.02%,370.64,A,A2,...,,,,N,,,,,,


**Tast 5 Test**

In [279]:
# Task 5 - Test

assert isinstance(loans, pd.DataFrame), 'Have you created a DataFrame named loans?'
assert loans.shape == (30000, 144), 'Double check your DataFrame size.'

# NO hidden tests for this task

**Introduction** - Approaching a problem two different ways.

In the Guided Project, we learned how to find the earliest credit year using the built-in date-time format.  However, there is often more than one way to come up with the same solution.  

In the following questions, we will work through the steps to create new variables called `Earliest_Credit_Year` and `issue_year` where we use the `.split()` function to split on the month and year in a string variable, and return just the year value. We can then use those variables to calculate the length of credit history in years.  

If we do everything correctly, we should calculate *about* the same length of longest credit history that we did in class working with the date-time format.  The answers won't exactly match because we are making different assumptions about when in each month and year the loans were taken out, but it will give you a flavor for different ways of approaching the same problem.

**Task 6** - Create a simple test case

The `credit` variable has been created for you with the value 'Jun-1979'

* Use the `.split('-')` function to separate the month and year parts of the data.
* Name the results of the .split('-') function `fields`: 
    * assign `fields[0]` to the variable `month`
    * assign `fields[1]` to the variable `year`

In [280]:
# Task 6

# Don't change or delete
credit = 'Jun-1979' 

# YOUR CODE HERE
fields = credit.split('-')
month = fields[0]
year = fields[1]
# Look at the year variable both as a string and as a float
print('String: ', year)
print('Float: ', float(year))


String:  1979
Float:  1979.0


**Task 6 Test**

In [281]:
# Task 6 - Test

# Hidden tests - you'll see results when you submit to Canvas

**Task 7** - Create a function

Use your answer from **Task 6** to write a function called `credit_yr` that takes in the contents of a cell that is formatted as "month-year" and returns the year.

* Complete the definition of the function `credit_yr` below:
    * one argument `cell_contents` as input (provided for you)
    * splits the `cell_content` on a `-` and calls the result `fields`
    * creates the variable `year` which has the value `fields[1]`
    * returns `year` as a float variable
* Run your function using the simple test case in Task 6 (`credit='Jun-1979'`). Assign the output of the function to the variable `year_function`.

In [282]:
# Task 7

# Complete the function definition
def credit_yr(cell_contents):
    fields = cell_contents.split('-')
    year = fields[1]
    return float(year)


# Print your year_function variable
year_function = credit_yr('Jun-1979')


**Task 7 Test**

In [283]:
# Task 7 - Test

assert isinstance(year_function, float), 'Make sure your year variable is a float!'

# Hidden tests - you'll see results when you submit to Canvas

**Task 8** - Apply our function

Now we're going to use the function we created and apply it to a column in our `loans` DataFrame. 

* Use the `.apply()` function to apply the function to every cell in the `earliest_cr_line` column.
* Assign the results to a new column called `Earliest_Credit_Year`.
* View the top five rows of the `Earliest_Credit_Year` and `earliest_cr_line` columns to make sure the variables were created correctly (code provided for you).

In [284]:
# Task 8

loans = loans.copy()

# YOUR CODE HERE
loans["Earliest_Credit_Year"] = loans['earliest_cr_line'].apply(credit_yr)

# Print your columns
loans[['Earliest_Credit_Year','earliest_cr_line']].head()


Unnamed: 0,Earliest_Credit_Year,earliest_cr_line
0,1998.0,Oct-1998
1,2015.0,Sep-2015
2,2003.0,Jun-2003
3,2005.0,Feb-2005
4,2008.0,Feb-2008


**Task 8 Test**

In [285]:
# Task 8 - Test

assert 'Earliest_Credit_Year' in loans.columns, 'Did you add the "Earliest_Credit_Year" column?'
assert loans.shape == (30000, 145), 'Double check your DataFrame size.'

# Hidden tests - you'll see results when you submit to Canvas

**Task 9** - Apply the function to a new column

So next, we're going to use the `credit_yr` function to create a new variable called `Issue_Year` from the `issue_d` variable. Follow the sames steps as you did in Task 8.

* Apply `credit_yr` to the `issue_d` column.
* Name your new column `Issue_Year`.
* View the `Issue_Year`,`issue_d` columns (code provided for you).

In [287]:
# Task 9

# YOUR CODE HERE
Issue_Year = loans['issue_d'].apply(credit_yr)
loans['Issue_Year'] = Issue_Year
# Print your columns
loans[['Issue_Year', 'issue_d']].head()


Unnamed: 0,Issue_Year,issue_d
0,2018.0,Dec-2018
1,2018.0,Oct-2018
2,2018.0,Oct-2018
3,2018.0,Nov-2018
4,2018.0,Dec-2018


**Task 9**

In [288]:
# Task 9 - Test

assert 'Issue_Year' in loans.columns, 'Did you add the "Issue_Year" column?'
assert loans.shape == (30000, 146), 'Double check your DataFrame size.'

# Hidden tests - you'll see results when you submit to Canvas

**Task 10** - Calculate length of time from your new columns

Now we're going to use the two new columns you created to calculate the length of credit history in years and in days.

* Create a new column called `Credit_History_Years` which is the difference between `Issue_Year` and `Earliest_Credit_Year`
* Create another new column called `Credit_History_Days` which is the value in the column you created above expect multipled by 365.25 to convert years to days.
* View the top five rows of the `loans` DataFrame (code has been provided).

Hint: If you are getting an error message, make sure to check the data type of the `year` values output by your function.

In [None]:
# Task 10

# YOUR CODE HERE
raise NotImplementedError()

loans[['Issue_Year','Earliest_Credit_Year','Credit_History_Years','Credit_History_Days']].head()

**Task 10 Test**

In [None]:
# Task 10 - Test

assert 'Credit_History_Years' in loans.columns, 'Did you add the correct column?'
assert 'Credit_History_Days' in loans.columns, 'Did you add the correct column converted to days?'
assert loans.shape == (30000, 148), 'Double check your DataFrame size.'

# Hidden tests - you'll see results when you submit to Canvas

**Task 11** - Calculate the maximum credit length

Finally, we're going to use results from the `Credit_History_Years` column and compare it to we got in the Guided Project.

* Find the maximum values in the `Credit_History_Years` column and assign it to the variable `max_credit`. Make sure you have a float value defined out to one decimal place.
* Compare the the value from the Guided Project (refer to your notebook from class).

Note - it won't be *exactly* the same because both methods are working in different ways, but they will give you a "pretty close" answer.

In [None]:
# Task 11

# YOUR CODE HERE
raise NotImplementedError()

**Task 11 Test**

In [None]:
# Task 11 - Test

# Hidden tests - you'll see results when you submit to Canvas
