In [7]:
import numpy as np
import pandas as pd

In [14]:
data = pd.read_csv('/Users/nick/github/hugo/project/formula1_1950_2020/results.csv')

As a small preparation, let's write a function that outputs several samples from the dataframes - a sample of 10 observations, a sample of 50 observations and a sample of 500 observations. When you implement the function you will see that you can create 3 sample dataframes in one shot. This is a wonderful property of Python. 

In [37]:
def sample_dfs(data, sample_size_S=10, sample_size_M=50, sample_size_L=500):
    
    """
    This function samples 3 times from a given data frame. 
    
    Input: DataFrame
    Output: DataFrames (3)
    
    """
    df_S = pd.DataFrame(data).sample(sample_size_S, random_state=1)
    df_M = pd.DataFrame(data).sample(sample_size_M, random_state=1)
    df_L = pd.DataFrame(data).sample(sample_size_L, random_state=1)
    
    return df_S, df_M, df_L

Run the following code and make sure the function runs with no errors. You can explore three objects in detail to make sure they are what are expected of them to be. 

In [38]:
df_S, df_M, df_L = sample_dfs(data)

assert df_S.shape[0] == 10
assert df_M.shape[0] == 50
assert df_L.shape[0] == 500

We are moving to a machine learning part of the course. This short notebook is aimed to make a first introduction to **sklearn** library and solidify "train-test" split concept. 

In [66]:
df_S.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
14011,14012,570,275,182,33,24,12,12,12,0.0,50,\N,\N,\N,\N,\N,\N,14
4959,4960,249,84,27,25,8,\N,R,14,0.0,67,\N,\N,\N,\N,\N,\N,5
3719,3720,193,76,24,19,22,\N,R,21,0.0,17,\N,\N,\N,\N,\N,\N,7
20884,20887,845,813,3,12,9,15,15,15,0.0,65,\N,\N,64,14,1:29.391,187.468,11
12728,12729,525,207,32,1,10,10,10,10,0.0,66,\N,\N,\N,\N,\N,\N,60


First thing to notice is that the library is enormous. Therefore, it is split into several sub-libraries. You can import the whole library, which is not recommended, a sub-library, or a function. 

In [29]:
import sklearn

In [30]:
from sklearn import model_selection

In [31]:
from sklearn.model_selection import train_test_split

It is generally a good practice to import only the functions you need. This approach also helps to reduce the amount of code you have to type. Compare below: 

In [32]:
#if you imported a library with 'import sklearn'
split_long = sklearn.model_selection.train_test_split(np.array(np.random.choice(1000, 220)))

In [33]:
#if you imported a function with 'from sklearn.model_selection import train_test_split'
split_short = train_test_split(np.array(np.random.choice(1000, 220)))

Now import the clean version of Kickstarter dataset and create a train-test split of the data. Your dependent variable is *state*. Your independent variables are *goal* and *backers*. Train-test split is *70/30*. Set random state to *22*. Assign the output of the function to 4 variables (the logic is the same as when you called **sample_dfs()** function above). Let the variable names be *X_train*, *X_test*, *y_train*, and *y_test*. Explore the output, make sure it makes sense to you in terms of dimensions, dependent and independent variables. Are the means of *goal*, and *backers* is train and test sets comparable?

In [44]:
data2 = pd.read_csv('/Users/nick/github/hugo/python basics/data_cleaning/kickstarter_clean.csv')

In [90]:
X_train, X_test, Y_train, Y_test = train_test_split(data2.loc[:, ['goal', 'backers']], 
                                                    data2.loc[:, ['state']], 
                                                    test_size = 0.3, 
                                                    random_state = 22)

In [91]:
X_train, X_test, Y_train, Y_test

(         goal  backers
 1335  25350.0      3.0
 301    5000.0    110.0
 108    1800.0      1.0
 975    2000.0     65.0
 1023   3000.0     68.0
 ...       ...      ...
 356     599.0      3.0
 960    1000.0      5.0
 812   10000.0      4.0
 132    3500.0     75.0
 885   25000.0      0.0
 
 [1400 rows x 2 columns],
              goal     backers
 1201  23805.09873   93.723718
 905   23805.09873   93.723718
 1978  23805.09873   93.723718
 1296  23805.09873   93.723718
 1752  23805.09873   93.723718
 ...           ...         ...
 666     200.00000    0.000000
 384   10000.00000  191.000000
 663    7500.00000   95.000000
 1852    250.00000    1.000000
 71     2000.00000   28.000000
 
 [600 rows x 2 columns],
            state
 1335      failed
 301   successful
 108       failed
 975   successful
 1023  successful
 ...          ...
 356       failed
 960       failed
 812       failed
 132   successful
 885         live
 
 [1400 rows x 1 columns],
            state
 1201      failed
 905 

In [92]:
X_train.head()

Unnamed: 0,goal,backers
1335,25350.0,3.0
301,5000.0,110.0
108,1800.0,1.0
975,2000.0,65.0
1023,3000.0,68.0


In [77]:
Y_train

(1500, 1)

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE