In [7]:
import numpy as np
import pandas as pd

In [14]:
data = pd.read_csv('/Users/nick/github/hugo/project/formula1_1950_2020/results.csv')

As a small preparation, let's write a function that outputs several samples from the dataframes - a sample of 10 observations, a sample of 50 observations and a sample of 500 observations. When you implement the function you will see that you can create 3 sample dataframes in one shot. This is a wonderful property of Python. 

In [37]:
def sample_dfs(data, sample_size_S=10, sample_size_M=50, sample_size_L=500):
    
    """
    This function samples 3 times from a given data frame. 
    
    Input: DataFrame
    Output: DataFrames (3)
    
    """
    df_S = pd.DataFrame(data).sample(sample_size_S, random_state=1)
    df_M = pd.DataFrame(data).sample(sample_size_M, random_state=1)
    df_L = pd.DataFrame(data).sample(sample_size_L, random_state=1)
    
    return df_S, df_M, df_L

Run the following code and make sure the function runs with no errors. You can explore three objects in detail to make sure they are what are expected of them to be. 

In [38]:
df_S, df_M, df_L = sample_dfs(data)

assert df_S.shape[0] == 10
assert df_M.shape[0] == 50
assert df_L.shape[0] == 500

We are moving to a machine learning part of the course. This short notebook is aimed to make a first introduction to **sklearn** library and solidify "train-test" split concept. 

In [39]:
sample_dfs(data)

(       resultId  raceId  driverId  constructorId number  grid position  \
 14011     14012     570       275            182     33    24       12   
 4959       4960     249        84             27     25     8       \N   
 3719       3720     193        76             24     19    22       \N   
 20884     20887     845       813              3     12     9       15   
 12728     12729     525       207             32      1    10       10   
 4041       4042     208        75             20     15    15       14   
 10159     10160     433       187              1      1    21        7   
 3310       3311     175        30              6      3     3        8   
 1255       1256      77         2              3      8     1        2   
 7787       7788      12         4              4      7    13       \N   
 
       positionText  positionOrder  points  laps     time milliseconds  \
 14011           12             12     0.0    50       \N           \N   
 4959             R      

First thing to notice is that the library is enormous. Therefore, it is split into several sub-libraries. You can import the whole library, which is not recommended, a sub-library, or a function. 

In [29]:
import sklearn

In [30]:
from sklearn import model_selection

In [31]:
from sklearn.model_selection import train_test_split

It is generally a good practice to import only the functions you need. This approach also helps to reduce the amount of code you have to type. Compare below: 

In [32]:
#if you imported a library with 'import sklearn'
split_long = sklearn.model_selection.train_test_split(np.array(np.random.choice(1000, 220)))

In [33]:
#if you imported a function with 'from sklearn.model_selection import train_test_split'
split_short = train_test_split(np.array(np.random.choice(1000, 220)))

Now import the clean version of Kickstarter dataset and create a train-test split of the data. Your dependent variable is *state*. Your independent variables are *goal* and *backers*. Train-test split is *70/30*. Set random state to *22*. Assign the output of the function to 4 variables (the logic is the same as when you called **sample_dfs()** function above). Let the variable names be *X_train*, *X_test*, *y_train*, and *y_test*. Explore the output, make sure it makes sense to you in terms of dimensions, dependent and independent variables. Are the means of *goal*, and *backers* is train and test sets comparable?

In [44]:
data2 = pd.read_csv('/Users/nick/github/hugo/python basics/data_cleaning/kickstarter_clean.csv')

In [60]:
def sample_dfs2(data, train_size=70, test_size=30):
    
    X_train = pd.DataFrame(data2['goal']).sample(train_size, random_state=22)
    X_test = pd.DataFrame(data2['goal']).sample(test_size, random_state=22)
    Y_train = pd.DataFrame(data2['backers']).sample(train_size, random_state=22)
    Y_test = pd.DataFrame(data2['backers']).sample(test_size, random_state=22)
    
    return X_train, X_test, Y_train, Y_test 

In [64]:
X_train, X_test, Y_train, Y_test = sample_dfs2(data2)
X_train, X_test, Y_train, Y_test

(               goal
 1201    23805.09873
 905     23805.09873
 1978    23805.09873
 1296    23805.09873
 1752    23805.09873
 ...             ...
 1159    12500.00000
 705      2500.00000
 76       5000.00000
 1604    75000.00000
 1274  1000000.00000
 
 [70 rows x 1 columns],
               goal
 1201   23805.09873
 905    23805.09873
 1978   23805.09873
 1296   23805.09873
 1752   23805.09873
 469    23805.09873
 168    23805.09873
 954    23805.09873
 734    23805.09873
 143    23805.09873
 918    19500.00000
 898     1400.00000
 649     2000.00000
 1071   10000.00000
 823    70000.00000
 732     8918.00000
 919     2200.00000
 1087   50000.00000
 308     7450.00000
 1753    4000.00000
 411    26500.00000
 1868    5000.00000
 1807   74000.00000
 1153     950.00000
 1948  391000.00000
 1907    6600.00000
 157    80000.00000
 860     1460.00000
 570    30000.00000
 922     6000.00000,
          backers
 1201   93.723718
 905    93.723718
 1978   93.723718
 1296   93.723718
 1752   93.

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE

In [None]:
#YOUR CODE GOES HERE