# Challenge Overview

In 1998, the Adventure Works Cycles company collected a large volume of data about their existing customers, including demographic features and information about purchases they have made. The company is particularly interested in analyzing customer data to determine any apparent relationships between demographic features known about the customers and the likelihood of a customer purchasing a bike. Additionally, the analysis should endeavor to determine whether a customer's average monthly spend with the company can be predicted from known customer characteristics.

In this project, you must tackle three challenges:

Challenge 1: Explore the data and gain some insights into Adventure Works customer characteristics and purchasing behavior.

Challenge 2: Build a classification model to predict customer purchasing behavior.

Challenge 3: Build a regression model to predict customer purchasing behavior.

# Challenge 1: Data Exploration

To complete this challenge:

1.   Download the Adventure Works data files - see previous unit.

2.   Clean the data by replacing any missing values and removing duplicate rows. In this dataset, each customer is identified by a unique customer ID. The most recent version of a duplicated record should be retained.

3.   Explore the data by calculating summary and descriptive statistics for the features in the dataset, calculating correlations between features, and creating data visualizations to determine apparent relationships in the data.

4.   Based on your analysis of the customer data **after** removing all duplicate customer records, evaluate the BikeBuyer and AveMonthSpend metrics

## Meeting Challenge 1

First things first - load the packages required as well as the datasets!!

In [0]:
#Loading required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

#Loading required datasets
CustomerDemographics = pd.read_csv('https://raw.githubusercontent.com/CeeThinwa/DataScienceLearning/master/AdvWorksCusts.csv')
AverageMonthlySpend = pd.read_csv('https://raw.githubusercontent.com/CeeThinwa/DataScienceLearning/master/AW_AveMonthSpend.csv')
BikeBuyer = pd.read_csv('https://raw.githubusercontent.com/CeeThinwa/DataScienceLearning/master/AW_BikeBuyer.csv')

#and additional test dataset for Challenge 2
TestData = pd.read_csv('https://raw.githubusercontent.com/CeeThinwa/DataScienceLearning/master/AW_test.csv')

#Viewing the data
print(CustomerDemographics.shape)
print(AverageMonthlySpend.shape)
print(BikeBuyer.shape)

#Viewing the datatypes within each dataset
print (' ')
print('Customer Demographics DataTypes')
print(CustomerDemographics.dtypes)
print(' ')
print('Average Monthly Spend DataTypes')
print(AverageMonthlySpend.dtypes)
print(' ')
print('Bike Buyer DataTypes')
print(BikeBuyer.dtypes)

(16519, 23)
(16519, 2)
(16519, 2)
 
Customer Demographics DataTypes
CustomerID               int64
Title                   object
FirstName               object
MiddleName              object
LastName                object
Suffix                  object
AddressLine1            object
AddressLine2            object
City                    object
StateProvinceName       object
CountryRegionName       object
PostalCode              object
PhoneNumber             object
BirthDate               object
Education               object
Occupation              object
Gender                  object
MaritalStatus           object
HomeOwnerFlag            int64
NumberCarsOwned          int64
NumberChildrenAtHome     int64
TotalChildren            int64
YearlyIncome             int64
dtype: object
 
Average Monthly Spend DataTypes
CustomerID       int64
AveMonthSpend    int64
dtype: object
 
Bike Buyer DataTypes
CustomerID    int64
BikeBuyer     int64
dtype: object


Next, we clean the AverageMonthlySpend dataset by removing duplicate rows and removing missing values...

In [0]:
#This involves first identifying the initial shape of the dataframe
print(AverageMonthlySpend.shape)

#then applying the dropna method to the data
AverageMonthlySpend.dropna()

#and checking for any duplicates
AverageMonthlySpend.drop_duplicates(['CustomerID'], keep = 'last')

#and finally checking the final shape of the dataframe
print(AverageMonthlySpend.shape)

#Knowing this, we can then identify the descriptive stats for each column!
print(' ')
print('**Median Row**')
print(AverageMonthlySpend.median())
print(' ')
AverageMonthlySpend.describe()

(16519, 2)
(16519, 2)
 
**Median Row**
CustomerID       20221.0
AveMonthSpend       68.0
dtype: float64
 


Unnamed: 0,CustomerID,AveMonthSpend
count,16519.0,16519.0
mean,20234.225195,72.405957
std,5342.515987,27.28537
min,11000.0,22.0
25%,15604.5,52.0
50%,20221.0,68.0
75%,24860.5,84.0
max,29482.0,176.0


Next, we repeat the cleaning process for the BikeBuyer dataset, checking the shape of the dataset to see if we have lost any rows....

In [0]:
#Again, first we identify the initial shape of the dataframe
print(BikeBuyer.shape)

#apply the dropna method to the data
BikeBuyer.dropna()

#and checking for any duplicates
BikeBuyer.drop_duplicates(['CustomerID'], keep = 'last')

#and finally checking the final shape of the dataframe
print(BikeBuyer.shape)

(16519, 2)
(16519, 2)


We can finally clean and prepare the CustomerDemographics dataset for further analysis...

In [0]:
#For one last time, we first identify the initial shape of the dataframe
print(CustomerDemographics.shape)

#and check for any duplicates
CustomerDemographics.drop_duplicates(['CustomerID'], keep = 'last')

#finally, we can check the final shape of the dataframe
print(CustomerDemographics.shape)
print(' ')

#To identify the median YearlyIncome for each occupation category, we first group by the Occupation series and then find the corresponding medians of Yearly Income
print(CustomerDemographics.groupby(['Occupation'])['YearlyIncome'].median())

(16519, 23)
(16519, 23)
 
Occupation
Clerical           49387.0
Management        118780.0
Manual             21722.5
Professional       99046.0
Skilled Manual     66481.0
Name: YearlyIncome, dtype: float64


Now that all 3 datasets are cleaned, it is time to merge them into a new dataframe that we can further analyse.

In [0]:
#First things first, we join the Average Monthly Spend to the Customer Demographics Data
IntegratedCustomerData = CustomerDemographics.join(AverageMonthlySpend['AveMonthSpend'])

#Next, we repeat the join so as to add the Bike Buyer series to the new integrated dataset
IntegratedCustomerData = IntegratedCustomerData.join(BikeBuyer['BikeBuyer'])

IntegratedCustomerData.head(10)

Unnamed: 0,CustomerID,Title,FirstName,MiddleName,LastName,Suffix,AddressLine1,AddressLine2,City,StateProvinceName,...,Occupation,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,AveMonthSpend,BikeBuyer
0,11000,,Jon,V,Yang,,3761 N. 14th St,,Rockhampton,Queensland,...,Professional,M,M,1,0,0,2,137947,89,0
1,11001,,Eugene,L,Huang,,2243 W St.,,Seaford,Victoria,...,Professional,M,S,0,1,3,3,101141,117,1
2,11002,,Ruben,,Torres,,5844 Linden Land,,Hobart,Tasmania,...,Professional,M,M,1,1,3,3,91945,123,0
3,11003,,Christy,,Zhu,,1825 Village Pl.,,North Ryde,New South Wales,...,Professional,F,S,0,1,0,0,86688,50,0
4,11004,,Elizabeth,,Johnson,,7553 Harness Circle,,Wollongong,New South Wales,...,Professional,F,S,1,4,5,5,92771,95,1
5,11005,,Julio,,Ruiz,,7305 Humphrey Drive,,East Brisbane,Queensland,...,Professional,M,S,1,1,0,0,103199,78,1
6,11006,,Janet,G,Alvarez,,2612 Berry Dr,,Matraville,New South Wales,...,Professional,F,S,1,1,0,0,84756,54,1
7,11007,,Marco,,Mehta,,942 Brook Street,,Warrnambool,Victoria,...,Professional,M,M,1,2,3,3,109759,130,1
8,11008,,Rob,,Verhoff,,624 Peabody Road,,Bendigo,Victoria,...,Professional,F,S,1,3,4,4,88005,85,1
9,11009,,Shannon,C,Carlson,,3839 Northgate Road,,Hervey Bay,Queensland,...,Professional,M,S,0,1,0,0,106399,74,0


In [0]:
#Now we can further analyse the data! 
#We want to identify the group of customers that account for the highest Average Monthly Spend:
#What are the max values for each gender?
print(IntegratedCustomerData.groupby(['Gender'])['AveMonthSpend'].max())
print(' ')

#To get the age of each customer, we first have to convert the BirthDate series into datetime format.
IntegratedCustomerData['BirthDate'] = pd.to_datetime(IntegratedCustomerData['BirthDate'])
#Next, we calculate the age of each customer and create a new series called Age:
Age = np.array((IntegratedCustomerData['BirthDate'].dt.year - 1998))
Age = np.abs(Age)
IntegratedCustomerData['Age'] = Age

#To get the age of each customer, we first have to convert the BirthDate series into datetime format.
TestData['BirthDate'] = pd.to_datetime(TestData['BirthDate'])
#Next, we calculate the age of each customer and create a new series called Age:
Age2 = np.array((TestData['BirthDate'].dt.year - 1998))
Age2 = np.abs(Age2)
TestData['Age'] = Age2

Gender
F    114
M    176
Name: AveMonthSpend, dtype: int64
 


In this section, we now turn our attention to the AveMonthSpend metric:


In [0]:
print(IntegratedCustomerData.groupby(['MaritalStatus'])['AveMonthSpend'].median())
print(' ')
print(IntegratedCustomerData.groupby(['Gender'])['AveMonthSpend'].median())
print(' ')
print(IntegratedCustomerData.groupby(['NumberCarsOwned'])['AveMonthSpend'].median())
print(' ')
print(IntegratedCustomerData.groupby(['Gender'])['AveMonthSpend'].var())
print(' ')
print(IntegratedCustomerData.groupby(['NumberChildrenAtHome'])['AveMonthSpend'].median())

MaritalStatus
M    74
S    62
Name: AveMonthSpend, dtype: int64
 
Gender
F    52
M    79
Name: AveMonthSpend, dtype: int64
 
NumberCarsOwned
0     65
1     63
2     64
3     92
4    100
Name: AveMonthSpend, dtype: int64
 
Gender
F    269.723488
M    727.487022
Name: AveMonthSpend, dtype: float64
 
NumberChildrenAtHome
0     57.0
1     68.0
2     79.0
3     89.5
4    101.0
5    110.0
Name: AveMonthSpend, dtype: float64


And we also evaluate the BikeBuyer metric more closely.

In [0]:
print(IntegratedCustomerData.groupby(['BikeBuyer'])['YearlyIncome'].median())
print(' ')
print(IntegratedCustomerData.groupby(['BikeBuyer'])['NumberCarsOwned'].median())
print(' ')
print(IntegratedCustomerData.groupby(['Occupation'])['BikeBuyer'].count())
print(' ')
print(IntegratedCustomerData.groupby(['Gender'])['BikeBuyer'].count())
print(' ')
print(IntegratedCustomerData.groupby(['MaritalStatus'])['BikeBuyer'].count())

BikeBuyer
0    65955.5
1    96122.0
Name: YearlyIncome, dtype: float64
 
BikeBuyer
0    1
1    2
Name: NumberCarsOwned, dtype: int64
 
Occupation
Clerical          2619
Management        2734
Manual            2138
Professional      4963
Skilled Manual    4065
Name: BikeBuyer, dtype: int64
 
Gender
F    8168
M    8351
Name: BikeBuyer, dtype: int64
 
MaritalStatus
M    8917
S    7602
Name: BikeBuyer, dtype: int64


# Challenge 2: Classification

To complete this challenge:

1.   Use the Adventure Works Cycles customer data you worked with in challenge 1 to create a classification model that predicts whether or not a customer will purchase a bike. The model should predict bike purchasing for new customers for whom no information about average monthly spend or previous bike purchases is available.

2.   Download the test data. This data includes customer features but does not include bike purchasing or average monthly spend values.

3.   Use your model to predict the corresponding test dataset. Don't forget to apply what you've learned throughout this course.





## Meeting Challenge 2

To meet this challenge, we first import any additional packages required, then load the additional data

In [0]:
#Importing additional packages...
import seaborn as sns
import numpy.random as nr
import math
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm

#Now let's examine the data
IntegratedCustomerData.head(5)

Unnamed: 0,CustomerID,Title,FirstName,MiddleName,LastName,Suffix,AddressLine1,AddressLine2,City,StateProvinceName,...,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,AveMonthSpend,BikeBuyer,Age
0,11000,,Jon,V,Yang,,3761 N. 14th St,,Rockhampton,Queensland,...,M,M,1,0,0,2,137947,89,0,32
1,11001,,Eugene,L,Huang,,2243 W St.,,Seaford,Victoria,...,M,S,0,1,3,3,101141,117,1,33
2,11002,,Ruben,,Torres,,5844 Linden Land,,Hobart,Tasmania,...,M,M,1,1,3,3,91945,123,0,33
3,11003,,Christy,,Zhu,,1825 Village Pl.,,North Ryde,New South Wales,...,F,S,0,1,0,0,86688,50,0,30
4,11004,,Elizabeth,,Johnson,,7553 Harness Circle,,Wollongong,New South Wales,...,F,S,1,4,5,5,92771,95,1,30


The next thing we will do is check for class imbalance in the IntegratedCustomerData...

In [0]:
#Upon checking for imbalance the following series stood out as moderately imbalanced
bikebuyer_counts1 = IntegratedCustomerData.groupby(['NumberChildrenAtHome'])['BikeBuyer'].count()
print(bikebuyer_counts1)
print(' ')

bikebuyer_counts2 = IntegratedCustomerData.groupby(['HomeOwnerFlag'])['BikeBuyer'].count()
print(bikebuyer_counts2)

NumberChildrenAtHome
0    9990
1    2197
2    1462
3    1066
4     952
5     852
Name: BikeBuyer, dtype: int64
 
HomeOwnerFlag
0     5387
1    11132
Name: BikeBuyer, dtype: int64


And then we can drop columns that contain Nan values

In [0]:
TrainData = IntegratedCustomerData.drop(['Title'], axis = 1)
TrainData = TrainData.drop(['MiddleName'], axis = 1)
TrainData = TrainData.drop(['Suffix'], axis = 1)
TrainData = TrainData.drop(['AddressLine2'], axis = 1)

TestData = TestData.drop(['Title'], axis = 1)
TestData = TestData.drop(['MiddleName'], axis = 1)
TestData = TestData.drop(['Suffix'], axis = 1)
TestData = TestData.drop(['AddressLine2'], axis = 1)

print(TrainData.shape)
print(TestData.shape)

(16519, 22)
(500, 20)


From this point, because we want our number of features to be nice and lean, we drop the descriptive data that came with the original dataframe.

In [0]:
TrainData = TrainData.drop(['FirstName'], axis = 1)
TrainData = TrainData.drop(['LastName'], axis = 1)
TrainData = TrainData.drop(['AddressLine1'], axis = 1)
TrainData = TrainData.drop(['City'], axis = 1)
TrainData = TrainData.drop(['StateProvinceName'], axis = 1)
TrainData = TrainData.drop(['CountryRegionName'], axis = 1)
TrainData = TrainData.drop(['PostalCode'], axis = 1)
TrainData = TrainData.drop(['PhoneNumber'], axis = 1)
TrainData = TrainData.drop(['BirthDate'], axis = 1)

TestData = TestData.drop(['FirstName'], axis = 1)
TestData = TestData.drop(['LastName'], axis = 1)
TestData = TestData.drop(['AddressLine1'], axis = 1)
TestData = TestData.drop(['City'], axis = 1)
TestData = TestData.drop(['StateProvinceName'], axis = 1)
TestData = TestData.drop(['CountryRegionName'], axis = 1)
TestData = TestData.drop(['PostalCode'], axis = 1)
TestData = TestData.drop(['PhoneNumber'], axis = 1)
TestData = TestData.drop(['BirthDate'], axis = 1)


print(TrainData.shape)
print(TestData.shape)
print(' ')
TrainData.head()

(16519, 13)
(500, 11)
 


Unnamed: 0,CustomerID,Education,Occupation,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,AveMonthSpend,BikeBuyer,Age
0,11000,Bachelors,Professional,M,M,1,0,0,2,137947,89,0,32
1,11001,Bachelors,Professional,M,S,0,1,3,3,101141,117,1,33
2,11002,Bachelors,Professional,M,M,1,1,3,3,91945,123,0,33
3,11003,Bachelors,Professional,F,S,0,1,0,0,86688,50,0,30
4,11004,Bachelors,Professional,F,S,1,4,5,5,92771,95,1,30


Now feature engineering can begin!! Actually, we already did this when we calculated the Age series using birthdate information  - we can now do this for the test data as well. We continue by creating dummy variables for all the series that contain categorical data as shown:

In [0]:
#We can then encode our categorical variables into dummy variables
#First we start with Marital Status, by creating the dummy variables that correspond to its categories
TrainData['MaritalStatus_Married'] = TrainData.MaritalStatus.map({'M': 1, 'S': 0})
TrainData['MaritalStatus_Single'] = TrainData.MaritalStatus.map({'M': 0, 'S': 1})

TestData['MaritalStatus_Married'] = TestData.MaritalStatus.map({'M': 1, 'S': 0})
TestData['MaritalStatus_Single'] = TestData.MaritalStatus.map({'M': 0, 'S': 1})


#We repeat the process for Gender:
TrainData['Gender_Male'] = TrainData.Gender.map({'M': 1, 'F': 0})
TrainData['Gender_Female'] = TrainData.Gender.map({'M': 0, 'F': 1})

TestData['Gender_Male'] = TestData.Gender.map({'M': 1, 'F': 0})
TestData['Gender_Female'] = TestData.Gender.map({'M': 0, 'F': 1})


#Then for Education:
education_dummies1 = pd.get_dummies(TrainData.Education)
TrainData = pd.concat([TrainData, education_dummies1], axis = 1)

education_dummies2 = pd.get_dummies(TestData.Education)
TestData = pd.concat([TestData, education_dummies2], axis = 1)

#And finally for Occupation:
occupation_dummies1 = pd.get_dummies(TrainData.Occupation)
TrainData = pd.concat([TrainData, occupation_dummies1], axis = 1)

occupation_dummies2 = pd.get_dummies(TestData.Occupation)
TestData = pd.concat([TestData, occupation_dummies2], axis = 1)

print(TrainData.shape)
print(TestData.shape)

(16519, 27)
(500, 25)


We need to rename the dummy variable series' names to identify which column they came from.

In [0]:
#for Education
TrainData.columns = [str.replace('Bachelors', 'Education_Bachelors') for str in TrainData.columns]
TrainData.columns = [str.replace('Graduate Degree', 'Education_Graduate') for str in TrainData.columns]
TrainData.columns = [str.replace('High School', 'Education_HS') for str in TrainData.columns]
TrainData.columns = [str.replace('Partial College', 'Education_PartialCollege') for str in TrainData.columns]
TrainData.columns = [str.replace('Partial High School', 'Education_PartialHS') for str in TrainData.columns]

TestData.columns = [str.replace('Bachelors', 'Education_Bachelors') for str in TestData.columns]
TestData.columns = [str.replace('Graduate Degree', 'Education_Graduate') for str in TestData.columns]
TestData.columns = [str.replace('High School', 'Education_HS') for str in TestData.columns]
TestData.columns = [str.replace('Partial College', 'Education_PartialCollege') for str in TestData.columns]
TestData.columns = [str.replace('Partial High School', 'Education_PartialHS') for str in TestData.columns]

#and Occupation.
TrainData.columns = [str.replace('Clerical', 'Occupation_Clerical') for str in TrainData.columns]
TrainData.columns = [str.replace('Management', 'Occupation_Management') for str in TrainData.columns]
TrainData.columns = [str.replace('Manual', 'Occupation_Manual') for str in TrainData.columns]
TrainData.columns = [str.replace('Professional', 'Occupation_Professional') for str in TrainData.columns]
TrainData.columns = [str.replace('Skilled Manual', 'Occupation_SM') for str in TrainData.columns]

TestData.columns = [str.replace('Clerical', 'Occupation_Clerical') for str in TestData.columns]
TestData.columns = [str.replace('Management', 'Occupation_Management') for str in TestData.columns]
TestData.columns = [str.replace('Manual', 'Occupation_Manual') for str in TestData.columns]
TestData.columns = [str.replace('Professional', 'Occupation_Professional') for str in TestData.columns]
TestData.columns = [str.replace('Skilled Manual', 'Occupation_SM') for str in TestData.columns]

print(TrainData.shape)
print(TestData.shape)

(16519, 27)
(500, 25)


Now that we have our dummy variables, we can now remove the original columns! It would probably be a good idea to also remove the CustomerID series at this point...

In [0]:
TrainData = TrainData.drop(['Education'], axis = 1)
TrainData = TrainData.drop(['Occupation'], axis = 1)
TrainData = TrainData.drop(['Gender'], axis = 1)
TrainData = TrainData.drop(['MaritalStatus'], axis = 1)
TrainData = TrainData.drop(['CustomerID'], axis = 1)

TestData = TestData.drop(['Education'], axis = 1)
TestData = TestData.drop(['Occupation'], axis = 1)
TestData = TestData.drop(['Gender'], axis = 1)
TestData = TestData.drop(['MaritalStatus'], axis = 1)
TestData = TestData.drop(['CustomerID'], axis = 1)

print(TrainData.shape)
print(TestData.shape)

#save to Excel
from google.colab import files
TestData.to_excel('AW_PreppedTestData.xlsx')

#files.download('AW_PreppedTestData.xlsx')#

(16519, 22)
(500, 20)


With all this preprocessing done, it is time to separate the label (BikeBuyer) and AveMonthSpend from the features.

In [0]:
Label = TrainData['BikeBuyer']
Features = TrainData.drop(['BikeBuyer'], axis = 1)
Features = Features.drop(['AveMonthSpend'], axis = 1)

print(Label.shape)
print(Features.shape)
print(TestData.shape)

#save to Excel
from google.colab import files
Features.to_excel('AV_Features.xlsx')

files.download('AV_Features.xlsx')

(16519,)
(16519, 20)
(500, 20)


We can now scale the Features using StandardScaler and then use the resulting scale to scale the test data as shown:

In [0]:
scaler = preprocessing.StandardScaler().fit(Features.iloc[:,19:])
Features.iloc[:,19:] = scaler.transform(Features.iloc[:,19:])
TestData.iloc[:,19:] = scaler.transform(TestData.iloc[:,19:])


  return self.partial_fit(X, y)
  
  This is separate from the ipykernel package so we can avoid doing imports until


With our values scaled for both training and testing datasets, let's build a logistic regression model!

In [0]:
#First we compute our model
logistic_mod = linear_model.LogisticRegression() 
logistic_mod.fit(Features,Label)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [0]:
#then we identify our coefficients and intercept
print(logistic_mod.intercept_)
print(logistic_mod.coef_)

[-0.01858827]
[[-1.55372032e-02  3.88399703e-02  3.82885877e-01  2.53339315e-01
   1.05828547e-05 -6.94886841e-02 -1.14073286e-01  9.54850140e-02
   3.30357451e-02 -5.16240171e-02  1.37489413e-02 -3.45595063e-02
   1.50677360e-04  8.41466343e-03 -6.34304789e-03 -3.93011025e-03
  -1.35062910e-02  5.48953500e-03  1.86762931e-02 -4.81593829e-02]]


Will our model up and running, let's identify the class with the highest probability to make our predictions on the test data

In [0]:
probabilities = logistic_mod.predict_proba(TestData)
print(probabilities[:9,:])

[[0.79665473 0.20334527]
 [0.40900206 0.59099794]
 [0.92245319 0.07754681]
 [0.65801973 0.34198027]
 [0.71442923 0.28557077]
 [0.905003   0.094997  ]
 [0.41572415 0.58427585]
 [0.21488823 0.78511177]
 [0.34881316 0.65118684]]


Finally, we get the actual scores/predictions made by our model below:

In [0]:
def score_model(probs, threshold):
    return np.array([1 if x > threshold else 0 for x in probs[:,1]])
scores = score_model(probabilities, 0.5)
Answer_array = np.array(scores[:500])
Answer = pd.DataFrame(Answer_array)

#save to Excel
from google.colab import files
Answer.to_excel('Answer.xlsx')

#files.download('Answer.xlsx')#

Answer.head(10)

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,0
5,0
6,1
7,1
8,1
9,0
