# Step 0: Import Libraries and Data

Libraries always go first.

In [1]:
import numpy as np
import pandas as pd

import random

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.precision', 2)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Before you import your data from the file, it is useful to see where you are in the directory. You can do it by running bash commands directly from Jupyter notebook by putting an exclamation mark in front of them. **pwd** shows you the current working directory and **ls** shows files in the directory.

In [None]:
!pwd

In [None]:
!ls

If your file is in the current working directory, you can run the following command to import it into the memory. Otherwise, you can indicate the correct path from your working directory, moving down the hierarchical folder structure like so: *'my/path/kickstarter.csv'* or up the hierarchical folder structure like so *'../kickstarter.csv'*.

In [3]:
data = pd.read_csv('kickstarter.csv')

Use pandas core commands you learned in the previous notebook to explore the contents of **data**.

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

Now, let's get rid of 2 columns: *Unnamed: 0* and *category*. First one is just a duplicate of index vector, the second one could have been useful, but as you have *main_category* column with fewer levels, it is more beneficial to keep that one. Use **drop()** function to accomplish this task and pay attention to the following attributes - **axis** and **inplace**. 

In [None]:
#YOUR CODE GOES HERE

Apply chained **isna()** and **sum()** operations to **data**. Do you see why it is beneficial to chain them? 

In [None]:
data.isna().sum()

Use the same approach to sum over duplicated rows and see if there are any. Use **duplicated()** function to get a boolean vector. 

In [None]:
#YOUR CODE GOES HERE

# Step 1: Create Trouble

To make it an interesting exercise, let's punch some random, yet reproducible, holes in the data. As you know, you can ask **numpy** to do this job. For example, this is how you can pick 10 unique random numbers from a sequence of values from 0 to 1999.

In [None]:
np.random.seed(22)
np.random.choice(np.arange(2000), (10,), replace=False)

 Can you explain that role **np.random.seed(22)** plays in this cell?   

Type here

Now, use the code above with **.loc** coding statement to set *goal* variable in the corresponding rows to missing. Missing value is set with **np.nan**.

In [None]:
#YOUR CODE GOES HERE

Display all rows where you set *goal* to missing. You can use **isna()** in a mask like so:

In [None]:
data[data['goal'].isna()]

Do you see missing values as expected?

In the similar fashion punch 7 holes in *backers* columns and 2 holes in *currency* column. Keep random seed equal to **22**.

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

Apply chained **isna()** and **sum()** operations to **data** to observe the result of code in this section.  

In [None]:
#YOUR CODE GOES HERE

# Step 2: Fix Things

### a) brute force elimination

Create a new clean data frame that contains the result of applying **dropna()** to **data**. Call it **data_1**.

In [None]:
data_1 = #YOUR CODE GOES HERE

Chain **isna()** and **sum()** to make sure all missing values are gone.

In [None]:
#YOUR CODE GOES HERE

What is the shape of **data_1**?

In [None]:
#YOUR CODE GOES HERE

Create another clean data frame **data_2** using a different approach. You know that missing values are in *usd.pledged*, *goal*, *backers*, and *currency* columns. Create a mask that accounts for keeping only values that are not missing in each of the columns. Use **notna()**. Remember, you can add separate conditions to a mask with **&**.

In [None]:
mask = #YOUR CODE GOES HERE

In [None]:
data_2 = data[mask]

What is the shape of **data_2**? Is it the same as the shape of **data_1**?

In [None]:
#YOUR CODE GOES HERE

### b) filling with an estimate

Instead of just removing a bunch of rows, you can estimate and impute values. Explore **fillna()** function. Use a measure of central tendency to impute missing values for three variables - *usd.pledged*, *goal*, and *backers*. Which measure is best? Why? Assign the resulting data frame to a new variable, call it **data_3**.

In [None]:
#YOUR CODE GOES HERE

Check is everything looks as expected, use one of the core functions for it.  

In [None]:
#YOUR CODE GOES HERE

To finish cleaning **data_3** you have to impute missing values for currency. This is a perfect situation to use your domain knowledge and save 2 rows of data. Look at 2 rows of data that contain missing values (use **isna()** function in a mask or any other method of your choice). What column should help you to make a reasonable assumption about missing currencies?

In [None]:
#YOUR CODE GOES HERE

To make sure you use consistent format as you are imputing missing *currency* values, print out the vector of currencies for this country (use **notna()** function and equality check in a mask or any other method of your choice).

In [None]:
#YOUR CODE GOES HERE

Assign correct currencies to the missing values. You can use **.loc** and a mask, or any other method. 

In [None]:
#YOUR CODE GOES HERE

Check is everything looks right, i.e. there are no missing values in your data frame.

In [None]:
#YOUR CODE GOES HERE

# Step 3: Variable Types

Let's continue working with **data_3** and look into those **object** data types. Let's convert some of them into categorical variables. Read about categorical variables here https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html. Use **astype()** to convert *main_category*, *currency*, *country*, *state* to **category** data type.  

Note: it is beneficial to consider categorical variables to speed up computations and to optimize memory used to store variables. For analysis you can oftentimes use both data types. 

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

Check is everything looks right in terms of data types and missing values with one of the core commands. 

In [None]:
#YOUR CODE GOES HERE

You took care of categorical variables. However, there is another set of strings that are not useful in current format - dates and times. Pandas has native **to_datetime()** function - use it to fix variable types for *deadline* and *launched* variables.

**Note**: the function is smart enough to figure out the format of the string containing date and time information on its own. With that said, when working with larger volumes of data it is a good practice to pass correct format to the function using **format** argument. It will speed up computations and give you more control as data formats might be inconsistent. 

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

After you converted dates into correct **datetime** types you can do operations on them. Let's build on top of your effort and engineer a new feature. Let it be a new column of the data frame that contains the duration of campaigns. You can call it **campaign_duration**.  

In [None]:
data_3['campaign_duration'] = #YOUR CODE GOES HERE

Have a look at the first couple of rows of the data frame. Do you see the new feature you just engineered?

In [None]:
#YOUR CODE GOES HERE

This looks nice. To be able to perform simple computations and to be independent on special types of variables, let's extract ***days*** from the new column and make sure it is of type ***integer***. Check out **dt.days** in pandas documentation. Feel free to chain operations or do it in 2 separate lines of code. The output you want to get is a simple column of integers signifying the number of days of each campaign in **duration_campaign** column.

In [None]:
#YOUR CODE GOES HERE

Check is everything looks right in terms of data types and missing values with one of the core commands. 

In [None]:
#YOUR CODE GOES HERE

Finally, you can store resulting data frame back into .csv file and use this file for data exploration. 

In [None]:
data_3.to_csv("kickstarter_clean.csv")

Using bash command ***ls*** you can check if it worked - if your file with clean data is in your working directory! Do not forget to put an exclamation mark before bash commands when calling them from Julyter notebooks.

In [None]:
!ls

Congratulations! You've cleaned the data!