# Classification Challenge

The dataset here comes from a kaggle example on Kickstarter projects.  Your goal is the following:

- Load and explore the data
- Determine strategy for missing variables
- Build classifier to predict `state` column.  
- Compare and visualize the `ROC` curve for three different classifiers:
 - `LogisticRegression`
 - `KNeighborsClassifier`
 - `DecisionTreeClassifier`
 
- What did the `DecisionTreeClassifier` decide were the most important features?  Visualize the top five.
- Visualize a `DecisionTreeClassifier` with depth 3, and describe the results.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/ks-projects.csv.zip', encoding='Windows-1252', compression='zip')



  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


In [50]:
df.groupby(['state','country']).size()

state       country
0           0             96
1           1             15
10          1             10
100         1              5
            2              1
            3              1
            4              1
1010        18             1
103         4              1
1035        12             1
1056        62             1
10564       91             1
106         2              1
10770       170            1
1085        45             1
10890.45    107            1
109         6              1
109301.56   2001           1
110         12             1
            2              1
            3              1
            4              1
            7              1
1100        12             1
11044       248            1
            81             1
11050       176            1
1111.11     9              1
111307.22   229            1
11315.5     92             1
                       ...  
successful  LU            13
            MX             2
            N,"0       

In [5]:
df.isnull().sum()

ID                     0
name                   4
category               5
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3790
Unnamed: 13       323125
Unnamed: 14       323738
Unnamed: 15       323746
Unnamed: 16       323749
dtype: int64

In [6]:
df.shape

(323750, 17)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323750 entries, 0 to 323749
Data columns (total 17 columns):
ID                323750 non-null int64
name              323746 non-null object
category          323745 non-null object
main_category     323750 non-null object
currency          323750 non-null object
deadline          323750 non-null object
goal              323750 non-null object
launched          323750 non-null object
pledged           323750 non-null object
state             323750 non-null object
backers           323750 non-null object
country           323750 non-null object
usd pledged       319960 non-null object
Unnamed: 13       625 non-null object
Unnamed: 14       12 non-null object
Unnamed: 15       4 non-null object
Unnamed: 16       1 non-null float64
dtypes: float64(1), int64(1), object(15)
memory usage: 42.0+ MB


In [14]:
df.columns

Index(['ID ', 'name ', 'category ', 'main_category ', 'currency ', 'deadline ',
       'goal ', 'launched ', 'pledged ', 'state ', 'backers ', 'country ',
       'usd pledged ', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16'],
      dtype='object')

In [17]:
df.columns = ['ID', 'name', 'category', 'main_category', 'currency', 'deadline','goal', 'launched', 'pledged', 'state', 'backers', 'country','usd pledged', 'Unnamed_13', 'Unnamed_14', 'Unnamed_15', 'Unnamed_16']

In [18]:
df.columns

Index(['ID', 'name', 'category', 'main_category', 'currency', 'deadline',
       'goal', 'launched', 'pledged', 'state', 'backers', 'country',
       'usd pledged', 'Unnamed_13', 'Unnamed_14', 'Unnamed_15', 'Unnamed_16'],
      dtype='object')

In [20]:
df.pledged[:5]

0        0
1      220
2        1
3     1283
4    52375
Name: pledged, dtype: object

In [24]:
df[['pledged','backers']] = df[['pledged','backers']].apply(pd.to_numeric)

ValueError: ('Unable to parse string "2016-01-03 00:56:46" at position 1454', 'occurred at index pledged')

In [29]:
df[['pledged','backers']] = df[['pledged','backers']].apply(pd.to_numeric)

ValueError: ('Unable to parse string "2014-08-09 03:16:02" at position 1563', 'occurred at index pledged')

In [39]:
df.pledged.value_counts()

0                      45784
1                       7782
10                      4201
25                      3479
5                       3207
50                      3181
20                      2664
100                     2638
2                       2147
30                      1830
15                      1516
40                      1263
60                      1243
35                      1205
11                      1084
75                      1084
150                     1021
200                      985
6                        978
70                       852
26                       849
125                      847
55                       825
3                        825
110                      804
51                       758
45                       751
80                       735
500                      698
21                       693
                       ...  
19508.24                   1
2015-10-30 17:11:49        1
1190.35                    1
47378         

In [40]:
df.main_category.value_counts()

Film & Video               57679
Music                      46744
Publishing                 34233
Games                      28008
Technology                 26128
Art                        23975
Design                     23872
Food                       21229
Fashion                    18398
Theater                     9972
Photography                 9680
Comics                      8753
Crafts                      7187
Journalism                  4073
Dance                       3375
Fiction                       35
Product Design                29
Documentary                   28
Nonfiction                    27
Children's Books              18
Tabletop Games                14
Shorts                        13
Video Games                   12
Mixed Media                   11
Apparel                       10
Art Books                     10
Web                            8
Narrative Film                 8
Country & Folk                 7
Apps                           7
          

In [41]:
df.category.value_counts()

Product Design                                        17477
Documentary                                           14891
Music                                                 13907
Shorts                                                11681
Tabletop Games                                        10708
Food                                                  10533
Video Games                                           10059
Film & Video                                           9207
Fiction                                                8231
Fashion                                                7910
Nonfiction                                             7404
Art                                                    6894
Theater                                                6833
Rock                                                   6345
Technology                                             5762
Children's Books                                       5651
Photography                             

In [42]:
df.country.value_counts()

US      257565
GB       27509
CA       11992
AU        6236
N,"0      3790
DE        2684
NL        2259
FR        1910
IT        1750
ES        1372
SE        1269
NZ        1136
DK         825
IE         575
NO         526
CH         471
BE         402
AT         377
MX         214
SG         119
0          102
HK          97
1           74
3           45
2           41
LU          40
4           28
7           18
10          14
6           12
         ...  
338          1
124          1
246          1
68           1
284          1
61           1
289          1
117          1
38           1
813          1
81           1
167          1
222          1
307          1
85           1
582          1
43           1
78           1
53           1
99           1
732          1
35           1
70           1
93           1
72           1
169          1
71           1
780          1
109          1
238          1
Name: country, Length: 162, dtype: int64

In [51]:
df.groupby(['state']).size()

state
0                 96
1                 15
10                10
100                8
1010               1
103                1
1035               1
1056               1
10564              1
106                1
10770              1
1085               1
10890.45           1
109                1
109301.56          1
110                5
1100               1
11044              2
11050              1
1111.11            1
111307.22          1
11315.5            1
1146               1
115                2
11558              1
11565              1
118                1
1181               1
1191               1
12                 2
               ...  
8542               1
856                1
86                 1
860                1
8609.6             1
865                1
876                1
890                2
90                 3
900                2
9187               1
9210.69            1
9337               1
936                1
938                1
9430.8             1
95     

In [64]:
status = (df.loc[df['state'].isin(['successful','failed'])])

In [65]:
status.head()[:5]

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed_13,Unnamed_14,Unnamed_15,Unnamed_16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21 18:30:44,1000,2014-12-01 18:30:44,1205,successful,16,US,1205,,,,


In [73]:
status.groupby('state').size()

state
failed        168221
successful    113081
dtype: int64

In [77]:
status.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 281302 entries, 0 to 323749
Data columns (total 17 columns):
ID               281302 non-null int64
name             281299 non-null object
category         281302 non-null object
main_category    281302 non-null object
currency         281302 non-null object
deadline         281302 non-null object
goal             281302 non-null object
launched         281302 non-null object
pledged          281302 non-null object
state            281302 non-null object
backers          281302 non-null object
country          281302 non-null object
usd pledged      281092 non-null object
Unnamed_13       0 non-null object
Unnamed_14       0 non-null object
Unnamed_15       0 non-null object
Unnamed_16       0 non-null float64
dtypes: float64(1), int64(1), object(15)
memory usage: 38.6+ MB


In [79]:
y = pd.get_dummies('state')

In [88]:
status.drop(['Unnamed_13','Unnamed_14','Unnamed_15','Unnamed_16'],axis=1)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21 18:30:44,1000,2014-12-01 18:30:44,1205,successful,16,US,1205
6,1000030581,Chaser Strips. Our Strips make Shots their B*tch!,Drinks,Food,USD,2016-03-17 19:05:12,25000,2016-02-01 20:05:12,453,failed,40,US,453
9,100004721,Of Jesus and Madmen,Nonfiction,Publishing,CAD,2013-10-09 18:19:37,2500,2013-09-09 18:19:37,0,failed,0,CA,0
10,100005484,Lisa Lim New CD!,Indie Rock,Music,USD,2013-04-08 06:42:58,12500,2013-03-09 06:42:58,12700,successful,100,US,12700
11,1000055792,The Cottage Market,Crafts,Crafts,USD,2014-10-02 17:11:50,5000,2014-09-02 17:11:50,0,failed,0,US,0
12,1000056157,G-Spot Place for Gamers to connect with eachot...,Games,Games,USD,2016-03-25 22:01:12,200000,2016-02-09 23:01:12,0,failed,0,US,0


In [90]:
status.shape

(281302, 17)

In [97]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, classification_report
from sklearn.pipeline import make_pipeline

In [114]:
target = pd.get_dummies(status.state)
y = target.successful
X = status[['goal','backers']]

In [115]:
y[:5]

0    0
1    0
2    0
4    1
5    1
Name: successful, dtype: uint8

In [116]:
clf = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [117]:
preds = clf.predict(X_test)

In [118]:
print(classification_report(preds, y_test))

             precision    recall  f1-score   support

          0       0.88      0.96      0.92     38730
          1       0.94      0.84      0.89     31596

avg / total       0.91      0.91      0.90     70326

