# Exploratory Data Analysis in Python 

In [2]:
import numpy as np  
import pandas as pd 

from matplotlib import pyplot as plt 
import seaborn as sns #another graphical library - works in tandem with matplotlib 
sns.set() #more stylish graphs

%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

We are going to work with Kickstarter Data. Download your clean dataset here. 

In [3]:
data = pd.read_csv('/Users/nick/github/hugo/python basics/data_cleaning/kickstarter_clean.csv')

### Initial glance at the data

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2000 non-null   int64  
 1   Unnamed: 0.1       2000 non-null   int64  
 2   ID                 2000 non-null   int64  
 3   name               2000 non-null   object 
 4   category           2000 non-null   object 
 5   main_category      2000 non-null   object 
 6   currency           2000 non-null   object 
 7   deadline           2000 non-null   object 
 8   goal               2000 non-null   float64
 9   launched           2000 non-null   object 
 10  pledged            2000 non-null   float64
 11  state              2000 non-null   object 
 12  backers            2000 non-null   float64
 13  country            2000 non-null   object 
 14  usd.pledged        2000 non-null   float64
 15  campaign_duration  2000 non-null   int64  
dtypes: float64(4), int64(4),

In [5]:
data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd.pledged,campaign_duration
0,0,1,239894312,Barwagen,Woodworking,Crafts,CHF,2017-11-30 00:00:00,500.0,2017-10-23 19:32:00,15.0,failed,2.0,CH,0.0,37
1,1,2,353169821,StrongMACH Customs muscle car replicas,Art,Art,USD,2016-10-03 00:00:00,100000.0,2016-08-04 18:29:00,0.0,failed,0.0,US,0.0,59
2,2,3,1387928487,Cravin' Dogs 30th anniversary CD project,Rock,Music,USD,2016-10-27 00:00:00,12500.0,2016-09-27 16:40:00,1180.0,failed,18.0,US,531.0,29
3,3,4,735160267,A Socially Awkward fundraiser!,Plays,Theater,GBP,2017-09-14 00:00:00,1500.0,2017-07-16 16:33:00,1500.0,successful,42.0,GB,144.05,59
4,4,5,1838469271,MechRunner,Video Games,Games,USD,2014-05-17 00:00:00,25000.0,2014-04-13 20:57:00,28434.0,successful,574.0,US,28434.0,33


In [None]:
# YOUR CODE GOES HERE

In [None]:
# YOUR CODE GOES HERE

### Summary statistics

In [6]:
data['goal'].mean()

23597.627693651593

In [7]:
data.goal.mean() #alternative syntax, not always works

23597.627693651593

In [8]:
data['goal'].median()

5000.0

#### Summary statistics of GOAL

a) mean and median

*Note*: try to remove **.T** to see what **T** is doing to the data frame.

In [11]:
pd.DataFrame([data['goal'].mean(), 
              data['goal'].median()], 
              index = ['mean', 'median'], 
              columns = ['goal']).T

Unnamed: 0,mean,median
goal,23597.627694,5000.0


In [12]:
pd.DataFrame([data['goal'].mean(), 
              data['goal'].median()], 
              index = ['mean', 'median'], 
              columns = ['goal'])

Unnamed: 0,goal
mean,23597.627694
median,5000.0


b) minimum, maximum and range

Using thre code above as an example, do it yourself.

In [13]:
pd.DataFrame([data['goal'].min(), 
              data['goal'].max()], 
              index = ['min', 'max'], 
              columns = ['goal']).T

Unnamed: 0,min,max
goal,1.0,2000000.0


c) variance and standard deviation

Using thre code above as an example, do it yourself.

*Note*: for standard deviation include 2 columns: one should use standard function and the other should calculate standard deviation from variance. 

In [14]:
pd.DataFrame([data['goal'].std(), 
              data['goal'].var()], 
              index = ['standard deviation', 'variance'], 
              columns = ['goal']).T

Unnamed: 0,standard deviation,variance
goal,99170.069352,9834703000.0


d) print out quantiles 1 and 3, as well as IQR.

*Note*: explore **percentile()** function in **numpy** library.

In [19]:
Q1 = np.percentile(data['goal'], 50)
Q3 = np.percentile(data['goal'], 50)

q75, q25 = np.percentile(data['goal'], [75 ,25])
IQR = q75 - q25

print('Q1:', Q1)
print('Q2:', Q3)
print('IQR:', IQR)

Q1: 5000.0
Q2: 5000.0
IQR: 15000.0


Assemble all the statistics you calculated above into one summary table. 

In [24]:
SummaryData = {'Q1':[Q1], 'Q3':[Q3], 'IQR':[IQR]}
summary = pd.DataFrame(SummaryData, columns = ['Q1', 'Q3', 'IQR'])
summary

Unnamed: 0,Q1,Q3,IQR
0,5000.0,5000.0,15000.0


Compare to a shotcut implementation.

In [25]:
data.goal.describe()

count    2.000000e+03
mean     2.359763e+04
std      9.917007e+04
min      1.000000e+00
25%      2.000000e+03
50%      5.000000e+03
75%      1.700000e+04
max      2.000000e+06
Name: goal, dtype: float64

#### Summary statistics of MAIN_CATEGORIES

Let's look at the categories. This is a simple example when we just count instances of records in our data frame by category. 

In [None]:
data.main_category.value_counts()

Alternatively, you can gain the same information with a **groupby** function. **groupby** is a super important function, it enables us to generalize our approach to creating statistical summaries. It splits the data into smaller chunks, runs separate computations on each of those chunks and combines the results.

In [None]:
data.groupby(['main_category'])['ID'].agg(['count']).sort_values(by='count', ascending=False)

Can you tell why the output is in a slightly different format?

In [None]:
Type here

Using the example code above add the second column to the summary that contains the mean goal by main category. As an argument of the **agg()** function make sure you use a dictionary instead of a list. The key of the dictionary is a name of the column. The value is the summary statistic you want. Both keys and values should be strings. Lastly, sort values by goal.    

In [None]:
# YOUR CODE GOES HERE

Now let's look into *state* variabe. See the counts of observations with different states. 

In [None]:
# YOUR CODE GOES HERE

Let's filter only *failed* and *successful* categories with a mask. 

In [None]:
# YOUR CODE GOES HERE

In [None]:
# YOUR CODE GOES HERE

Now let's see same statistics - count of observations and mean of *goal* by *state*. Chain **rename()** and **reset_index()** functions to your code to get a neater table. 

In [None]:
# YOUR CODE GOES HERE

Let's see the attributes of successful project. Group only successful projects by main category and observe same statistics there. 

In [None]:
# YOUR CODE GOES HERE

In [None]:
# YOUR CODE GOES HERE

### Visualizations

Create a boxplot of *goal*.

In [None]:
# YOUR CODE GOES HERE

This is how you can create subplots of *goal*, *pledged*, and *bakers*. Study the code below - it allows to produce a few plots in a row. 

In [None]:
fig, ax = plt.subplots(1, 3, sharex=False, sharey=False, figsize=[15,7])
plt.suptitle('Boxplots with Outliers', fontsize=18)

#Row1
ax[0].boxplot(data.goal)
ax[0].set_title('Goal', loc='center', fontsize=14)
ax[0].set_xticklabels('')

#Row2
ax[1].boxplot(data.pledged)
ax[1].set_title('Pledged', loc='center', fontsize=14)
ax[1].set_xticklabels('')

#Row13
ax[2].boxplot(data.backers)
ax[2].set_title('Backers', loc='center', fontsize=14)
ax[2].set_xticklabels('');

Use a mask to filter out outliers and see how the boxplots are changing. 

In [None]:
mask = # YOUR CODE GOES HERE

In [None]:
new_data = # YOUR CODE GOES HERE

In [None]:
# YOUR CODE GOES HERE

Similarly, build three subplots with **histograms** of *goal*, *pledged*, and *bakers*.

In [None]:
# YOUR CODE GOES HERE

Now, build 2 bar charts of *main_category* counts and *state* counts as subplots. What is better looking in this case - vertical or horizontal implementation? 

In [None]:
# YOUR CODE GOES HERE

In [None]:
# YOUR CODE GOES HERE

In [None]:
# YOUR CODE GOES HERE

Let's look at relationship between *pledged* and *backers*. Also, it is interesting to color the data by state - either *successful* or *failed*. It is more convenient to use **scatterplot()** function from **seaborn** package to combine these three variables in the graph. Check it out. Also see what  

* plt.figure(figsize=[10,10])

is doing. Try to run your code with and without it. It is important to notice that **matplotlib** library works together with **seaborn** library to create a single graph.

In [None]:
plt.figure(figsize=[10,10])
# YOUR CODE GOES HERE

You can see the general trend, but because most of the points are densely packed at the bottom left, it is a little ambiguous. We will use log transformation to get a better look. Notice that to transform the data I am using something called **lambda** function. it is just a function that is not defined in memory and, therefore, is used only once (as opposed to the function that we defined in an "area code" exercise). All the function is doing is saying 'make every x a log of x' and apply it to every value in a backers column with **apply()** function.

In [None]:
new_data_log = new_data[(new_data['state'] == 'successful') | (new_data['state'] == 'failed')]

new_data_log.backers = new_data_log.backers.apply(lambda x:  np.log(x))
new_data_log.pledged = new_data_log.pledged.apply(lambda x:  np.log(x))
new_data_log.head()

Now, create the same plot, but use the **new_data_log** data frame with transformed variables. 

In [None]:
# YOUR CODE GOES HERE

Experiment with the data a little more on your own. What story can you tell about **Kickstarter** data now?

Type here