# AI Project
***
## Administration and Rules:

Greetings, this is the first project in our project series. Generally, projects are larger problem sets, and contain more open-ended questions. You will be given some general guidelines and are allowed to explore more freely.

* **Guideline** In your submitted version
    * This project is divided into 5 parts.
        * For each part, you will be given a separate deadline.
        * Upon the deadline of each part, a sample analysis of this previous part will be released for you as your starting point for the next parts. (Think of this as Checkpoints in gaming)
        * So, start early!

    
* **How to submit:**
    * Submit on Google Classroom

***
## Introduction
In this project, you will be using a dataset called "Ghouls Goblins and Ghosts"

### Context
After a month of making scientific observations and taking careful measurements, we’ve determined that 900 ghouls, ghosts, and goblins are infesting our halls and frightening our fellow teachers and students here at Pinghe School. When trying garlic, asking politely, and using reverse psychology didn't work, it became clear that machine learning is the only answer to banishing our unwanted guests.

![halloween-660x.png](https://i.loli.net/2020/05/24/WeOE4TFNJp3DQnm.png)

So now the hour has come to put the data we’ve collected in your hands. We’ve managed to identify 371 of the ghastly creatures, but need your help to vanquish the rest. And only an accurate classification algorithm can thwart them. Use bone length measurements, severity of rot, extent of soullessness, and other characteristics to distinguish (and extinguish) the intruders. Are you ghost-busters up for the challenge?

### Dataset
Dataset file:
* Train: https://drive.google.com/file/d/1A7kgIjEruZv3qWbNl5kWzP5xRQbZmou5/view?usp=sharing
* Test: https://drive.google.com/file/d/1UhYjXWwH5L4BzUPWB9Iq5wO05_2SyMdm/view?usp=sharing

File descriptions
* `train.csv` - the training set, which contains both features and labels (target variables)
* `test.csv` - the test set, which contains only features and your job is to predict the types

Data fields
* id - id of the creature
* bone_length - average length of bone in the creature, normalized between 0 and 1
* rotting_flesh - percentage of rotting flesh in the creature
* hair_length - average hair length, normalized between 0 and 1
* has_soul - percentage of soul in the creature
* color - dominant color of the creature: 'white','black','clear','blue','green','blood'
* type - target variable: 'Ghost', 'Goblin', and 'Ghoul'

### Goal
Predict the types of the spooky creatures!

***
## Project Milestone 0 - Starter

Make sure you can run this starter

In [None]:
# Import libraries
import numpy as np
import pandas as pd

In google colab, loading data is a little bit different, we need to first mount the drive

In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Save a shortcut copy of the dataset onto your own drive
* Data: https://drive.google.com/file/d/1A7kgIjEruZv3qWbNl5kWzP5xRQbZmou5/view?usp=sharing

In [None]:
# Load dataset

df = pd.read_csv('/content/drive/MyDrive/train.csv')
df.head()

Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,type
0,0,0.354512,0.350839,0.465761,0.781142,clear,Ghoul
1,1,0.57556,0.425868,0.531401,0.439899,green,Goblin
2,2,0.467875,0.35433,0.811616,0.791225,black,Ghoul
3,4,0.776652,0.508723,0.636766,0.884464,black,Ghoul
4,5,0.566117,0.875862,0.418594,0.636438,green,Ghost


## Congratulations to have successfully run the Project Starter!

***
## Project Milestone #1 (31 points)


This is the first portion of the project, you are asked to explore the dataset using the tools we have learned in class so far.

### Question 1.1 (7 points)
Get started:
* Import the libraries **(2 point)**
* Import the data: both the training set and the test set **(2 point)**
* Take a quick look at the dataset, are there any missing values? **(3 points)**

In [None]:
# Import the libraries, expand the list as needed
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

#### Import the training set**(1 point)**
Dataset file:
* Train: https://drive.google.com/file/d/1A7kgIjEruZv3qWbNl5kWzP5xRQbZmou5/view?usp=sharing

In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Read in the training set
train=pd.read_csv('/content/drive/MyDrive/train.csv')

# print the shapes of the training set
print(train.shape)

# show the head of the training set
train.head()

(371, 7)


Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,type
0,0,0.354512,0.350839,0.465761,0.781142,clear,Ghoul
1,1,0.57556,0.425868,0.531401,0.439899,green,Goblin
2,2,0.467875,0.35433,0.811616,0.791225,black,Ghoul
3,4,0.776652,0.508723,0.636766,0.884464,black,Ghoul
4,5,0.566117,0.875862,0.418594,0.636438,green,Ghost


In [None]:
# Let's check if there are any missing Values
train.isnull().sum()

id               0
bone_length      0
rotting_flesh    0
hair_length      0
has_soul         0
color            0
type             0
dtype: int64

**Answer:** Our data set does not have any missing nor null/na values. We are good to proceed.

### Question 1.2 **(12 points)**
* What types of data are the features? Which are Quantitative and which are Qualitative? **(2 points)**
* For the qualitative feature(s), perform a One-hot-encoding to transform it into separate columns for each category. Augment these new columns into the your `DataFrame` as new features. **(6 points)**
* Drop features that you think are either irrelavant or redundant and store features and types into separate variables (for both train and test set), so 3 variables in total (you do not observe targets for the test set). **(4 points)**

#### What types of data are the features? Which are Quantitative and which are Qualitative? **(2 points)**

**Answer:** From our previous exploration, the `Color` variable is categorical while the other features are all quantitative. Next, let's encode the `'Color'` feature before visualization. Right now, we will use a LabelEncoder to change the colors from text to integers of 0, 1, 2, 3, 4, 5

#### For the qualitative feature(s), perform a One-hot-encoding to transform it into separate columns for each category. Augment these new columns into the your `DataFrame` as new features. **(6 points)**

In [None]:
# Let's do a one-hot encoding for the Color variable
Onehot=pd.get_dummies(df["color"])


In [None]:
# use pd.concat to get augmented DataFrame
train2=pd.concat([train,Onehot],axis=1)

**Answer:** we have finished one-hot-encoding for the `Color` variable

####  Drop features that you think are either irrelavant or redundant and store features and types into separate variables, so you should have two variables ready `train_features` and `train_target`. **(4 points)**

In [None]:
# First define your target variable y
train_features=train2.drop(["type"],axis=1)
train_target=train2["type"]
true_train_target=train_target.copy()
train_ghoul=train_target.copy()
train_goblin=train_target.copy()

Your training features `train_features` should look like this
![image.png](https://i.loli.net/2021/04/29/PjbAqUVOzvcRDX2.png)

In [None]:
train_features

Unnamed: 0,id,bone_length,rotting_flesh,hair_length,has_soul,color,black,blood,blue,clear,green,white
0,0,0.354512,0.350839,0.465761,0.781142,clear,0,0,0,1,0,0
1,1,0.575560,0.425868,0.531401,0.439899,green,0,0,0,0,1,0
2,2,0.467875,0.354330,0.811616,0.791225,black,1,0,0,0,0,0
3,4,0.776652,0.508723,0.636766,0.884464,black,1,0,0,0,0,0
4,5,0.566117,0.875862,0.418594,0.636438,green,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
366,886,0.458132,0.391760,0.660590,0.635689,blue,0,0,1,0,0,0
367,889,0.331936,0.564836,0.539216,0.551471,green,0,0,0,0,1,0
368,890,0.481640,0.501147,0.496446,0.544003,clear,0,0,0,1,0,0
369,896,0.294943,0.771286,0.583503,0.300618,clear,0,0,0,1,0,0


In [None]:
# Let's drop the 'id' and 'color' column as it doesn't help our classification
train_features=train_features.drop(["id","color"],axis=1)

**Answer:**
* We have finished processing our feature variables for the training set.
* We have also stored training targets into a separate variable.

### Question 1.3 **(12 points)**
In this part of the project, let's treat this as separate binary classification problems first. Suppose that now we are only interested in identifying all the `Ghosts`. Then effectively, we can think of the problem as having just two classes: Ghosts and Non-Ghosts.

Repeat this process for each type and we effectively have done the so-called One-vs-all multi-class classification method.


Let's first perform a Logistic Regression for the `Ghost` vs `Other` problem. Follow the steps:
* Transform your target variables into 0s and 1s - 1 for `Ghost`, 0 for `Other` **(4 Points)**
* Run a Logistic Regression **(8 Points)**
    * Pick accuracy as the model evaluation metrics
    * Construct a validation set / method, explain why you pick this validation method.
    * Calculate and output the evaluation metrics on the validation set / method.

#### Transform your target variables into 0s and 1s - 1 for `Ghost`, 0 for `Other` **(4 Points)**

Make a new target variable called `train_target_ghost`, where
* 1: if type = 'Ghost'
* 0: otherwise

**Hint: you can use Selection method in Pandas to achieve this**

Your `train_target_ghost` should look like this:

![image.png](https://i.loli.net/2021/04/29/gAhWN1rVEToxkDX.png)

In [None]:
# Next we need to process the target variables.
# Note that we now only care about whether the creature is a Ghost or not - Binary Classification
# Hint: you can use Selection method in Pandas to achieve this
train_target_copy=train_target.copy()
train_target_copy[train_target_copy=="Ghost"]=0
train_target_copy[train_target_copy=="Ghoul"]=1
train_target_copy[train_target_copy=="Goblin"]=2
train_target[train_target=="Ghost"]=1
train_target[train_target!=1]=0
train_target_ghost=train_target
print(train_target_ghost)
print(train_target_copy)

0      0
1      0
2      0
3      0
4      1
      ..
366    0
367    1
368    0
369    1
370    0
Name: type, Length: 371, dtype: object
0      1
1      2
2      1
3      1
4      0
      ..
366    2
367    0
368    1
369    0
370    1
Name: type, Length: 371, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


**Answer:** we have transformed targets into 0 and 1 with `Ghost` being the positive case.

#### Split the training set into training vs validation set using the train_test split method

You should have 4 variables ready:
* X_train
* X_valid
* y_train_ghost
* y_valid_ghost

In [None]:
# split the dataset
X=train_features
y=train_target_ghost
X_train, X_valid, y_train_ghost, y_valid_ghost = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

X_train.shape, X_valid.shape, y_train_ghost.shape, y_valid_ghost.shape


((259, 10), (112, 10), (259,), (112,))

#### Run a Logistic Regression **(8 Points)**
* Pick accuracy as the model evaluation metrics
* Calculate and output the evaluation metrics on the validation set / method.

In [None]:
# Run a standard logistic regression
reg=LogisticRegression()
y_train_ghost=y_train_ghost.astype(int, copy=True, errors='raise')
y_valid_ghost=y_valid_ghost.astype(int, copy=True, errors='raise')
# Make model predictions on the validation set
reg.fit(X_train,y_train_ghost)
solution=reg.predict(X_valid)


# Let's pick accuracy for our metrics
scaler = StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_valid=scaler.transform(X_valid)
reg=LogisticRegression()
reg.fit(X=X_train,y=y_train_ghost)
ghost_solution=reg.predict(X_valid)
ghost_proba=reg.predict_proba(X_valid)
ghost_proba=ghost_proba[:,1]
print(ghost_proba)

# calculate accuracy on the validation data
(y_valid_ghost==ghost_solution).mean()

[4.75575827e-03 9.67707003e-01 7.57704058e-02 6.13606985e-02
 8.56029333e-01 5.23171330e-01 2.03288975e-04 9.91053653e-01
 8.78879306e-02 4.30630885e-01 2.51719509e-02 9.40188614e-01
 4.23334913e-01 9.97559015e-01 1.49226339e-01 6.76001235e-01
 6.85703323e-01 9.25192090e-01 5.80376919e-05 5.63565378e-04
 5.17545390e-01 7.69035579e-04 6.42726315e-04 8.39180546e-01
 2.25047825e-01 4.35515561e-03 2.71876252e-01 8.48828203e-04
 1.73368514e-03 2.74516356e-04 3.84449073e-02 9.98217138e-01
 1.67600214e-03 9.98028631e-01 2.89312281e-03 8.72268972e-01
 6.15144264e-01 9.08108721e-06 9.98823097e-01 8.46963533e-03
 4.27342233e-04 2.12981260e-02 8.50851770e-03 2.43793331e-02
 1.04193439e-04 1.86500187e-03 2.05224669e-02 9.67871421e-03
 9.77276725e-03 4.99086610e-01 1.41855322e-01 3.26740966e-01
 3.56448369e-01 2.98564858e-02 9.62886366e-01 6.18884511e-03
 8.60269089e-01 2.37024834e-04 6.14244315e-05 4.69479998e-03
 5.87347487e-01 1.52387356e-01 5.53050971e-02 9.42355055e-03
 9.79961823e-01 1.980162

0.9017857142857143

## Project Milestone #2 **(30 points)**


In this part of the project, let's treat this as separate binary classification problems first. Suppose that now we are only interested in identifying all the `Ghosts`. Then effectively, we can think of the problem as having just two classes: Ghosts and Non-Ghosts.

Repeat this process for each type and we effectively have done the so-called One-vs-all multi-class classification method. Let's explore this in Milestone #2

## 2.1 One-vs-all method (30 points)
In this part, finish up the one-vs-all method. (Same as in 2.1, if you have already finished 2.1, you can simply copy your code from there)


Here is an outline of what you need to do in this part:


1. Split your data into train and validation
2. Run 3 separate Logistic Regression models on the following Binary Classification Problems
  * Ghost-vs-all
  * Ghoul-vs-all
  * Goblin-vs-all
3. For each validation sample
  * Predict probabilities using the 3 Logistic Regression models above (you can use the function model.predict_proba, which will return 2 columns of probabilities, column 0 for Class 0 and column 1 for Class 1)
  * For each sample row, you will need to make classification given the 3 probablities you got (Classification means: you need to decide whether the creature is Ghost, Ghoul or Goblin)
4. After you have your predicted classes for all the validation data, calculate the accuracy of the One-vs-all model
  * You will need 2 columns - your predicted class obtained in step 3 above, and the 'correct answers'
  * Comparing the 2 columns and calculate your model accuracy



In [None]:
# One-vs-all method code
# You can add as many cells as you like
# This is probably the harded part of the entire project
# There are a lot of intermediate steps
# Check your variables frequently and make sure you are on the right track before moving on
# Your final output of this part is the accuracy (a number) of the One-vs-all model

train_ghoul[train_ghoul=="Ghoul"]=1
train_ghoul[train_ghoul!=1]=0
train_target_ghoul=train_ghoul
print(train_target_ghoul)

0      1
1      0
2      1
3      1
4      0
      ..
366    0
367    0
368    1
369    0
370    1
Name: type, Length: 371, dtype: object


In [None]:
X=train_features
y=train_target_ghoul
X_train_ghoul, X_valid_ghoul, y_train_ghoul, y_valid_ghoul = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

X_train_ghoul.shape, X_valid_ghoul.shape, y_train_ghoul.shape, y_valid_ghoul.shape
y_train_ghoul=y_train_ghoul.astype(int, copy=True, errors='raise')
y_valid_ghoul=y_valid_ghoul.astype(int, copy=True, errors='raise')

In [None]:
scaler.fit(X_train_ghoul)
X_train_ghoul=scaler.transform(X_train_ghoul)
X_valid_ghoul=scaler.transform(X_valid_ghoul)
reg.fit(X=X_train_ghoul,y=y_train_ghoul)
ghoul_solution=reg.predict(X_valid_ghoul)
print((y_valid_ghoul==ghoul_solution).mean())
ghoul_proba=reg.predict_proba(X_valid_ghoul)
ghoul_proba=ghoul_proba[:,1]
print(ghoul_proba)

0.8125
[0.84548733 0.00259298 0.60657642 0.23120818 0.0621295  0.09506814
 0.97219932 0.00633493 0.05421835 0.03092395 0.34419161 0.0105422
 0.0652891  0.00414131 0.02229177 0.04772003 0.08072072 0.01446386
 0.99155048 0.94524026 0.00562945 0.94214649 0.50977572 0.04041633
 0.55042503 0.59172613 0.15244732 0.80565076 0.86801584 0.91102928
 0.43281683 0.00682913 0.70519983 0.00700618 0.60855164 0.03833867
 0.01686211 0.99698906 0.00187551 0.67153004 0.90617598 0.82187323
 0.69689514 0.21546162 0.9820868  0.88124451 0.25102566 0.80662325
 0.54834897 0.2113489  0.03957897 0.20828146 0.04441231 0.41260542
 0.00138228 0.71183752 0.08068201 0.95929078 0.95842534 0.50849653
 0.02214994 0.149796   0.4369284  0.48056954 0.00116257 0.86375838
 0.52802844 0.13685508 0.10558992 0.99306026 0.50650165 0.78966347
 0.15628474 0.95528805 0.01841923 0.27361967 0.65466306 0.61938029
 0.89667748 0.0935496  0.54189381 0.37273196 0.00586103 0.00588071
 0.1241764  0.01159283 0.06809427 0.06875112 0.00293991 

In [None]:
train_goblin[train_goblin=="Goblin"]=1
train_goblin[train_goblin!=1]=0
train_target_goblin=train_goblin
print(train_target_goblin)

0      0
1      1
2      0
3      0
4      0
      ..
366    1
367    0
368    0
369    0
370    0
Name: type, Length: 371, dtype: object


In [None]:
X=train_features
y=train_target_goblin
X_train_goblin, X_valid_goblin, y_train_goblin, y_valid_goblin = train_test_split(X, y, test_size=0.3,random_state=1, shuffle=True)
X_train.shape, X_valid.shape, y_train_goblin.shape, y_valid_goblin.shape
y_train_goblin=y_train_goblin.astype(int, copy=True, errors='raise')
y_valid_goblin=y_valid_goblin.astype(int, copy=True, errors='raise')

In [None]:
scaler.fit(X_train_goblin)
X_train_goblin=scaler.transform(X_train_goblin)
X_valid_goblin=scaler.transform(X_valid_goblin)
reg.fit(X=X_train_goblin,y=y_train_goblin)
goblin_solution=reg.predict(X_valid_goblin)
print((y_valid_goblin==goblin_solution).mean())
goblin_proba=reg.predict_proba(X_valid_goblin)
goblin_proba=goblin_proba[:,1]
print(goblin_proba)

0.7142857142857143
[0.27690485 0.29411303 0.15154165 0.37211482 0.20202387 0.16628269
 0.26194085 0.16353506 0.43619443 0.43150404 0.386512   0.20283762
 0.32324967 0.13626835 0.57716382 0.27408166 0.28717299 0.16772484
 0.20057396 0.23366954 0.54001792 0.22920802 0.727072   0.20924515
 0.05940595 0.43875567 0.23637019 0.42931074 0.26265015 0.40648545
 0.29852223 0.09523222 0.46052459 0.07552441 0.44057506 0.19575953
 0.45039369 0.30983408 0.14448498 0.27061568 0.35999133 0.10988908
 0.16945998 0.38097159 0.28444137 0.34012514 0.29837916 0.18298319
 0.40268398 0.13478518 0.57305924 0.25095503 0.42453159 0.30364665
 0.57275077 0.3907087  0.12199015 0.30240979 0.47373338 0.42799011
 0.35780093 0.43855055 0.22746992 0.41422653 0.53819259 0.27231781
 0.3826697  0.68069104 0.63010285 0.28614768 0.49732598 0.45595101
 0.59700449 0.38270811 0.24349787 0.1845523  0.38224255 0.4031532
 0.38052543 0.1696973  0.29065393 0.43703846 0.53560546 0.31781587
 0.33535158 0.46896391 0.40698702 0.29823162

In [None]:
X=train_features
y=train_target_copy
X_train_copy, X_valid_copy, y_train_copy, y_valid_copy = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

X_train_copy.shape, X_valid_copy.shape, y_train_copy.shape, y_valid_copy.shape

((259, 10), (112, 10), (259,), (112,))

In [None]:
a1=[]
b1=[]
c1=[]
for i in range(len(ghost_proba)):
  a=ghost_proba[i]
  a1.append(a)
  b=ghoul_proba[i]
  b1.append(b)
  c=goblin_proba[i]
  c1.append(c)
a1=np.array(a1)
b1=np.array(b1)
c1=np.array(c1)
a1=a1.reshape(-1,1)
b1=b1.reshape(-1,1)
c1=c1.reshape(-1,1)
s=np.concatenate((a1,b1,c1),axis=1)

target=[]
for i in s:
  target.append(i.argmax())
print(target)

"""lastSolution=[]
for i in target:
  if i==0:
    lastSolution.append(1)
  else:
    lastSolution.append(0)
print(lastSolution)"""
print(list(y_valid_copy))
(y_valid_copy==target).mean()

[1, 0, 1, 2, 0, 0, 1, 0, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 1, 1, 2, 1, 2, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 0, 2, 0, 2, 1, 0, 1, 0, 1, 1, 1, 0, 2, 1, 1, 0, 1, 1, 2, 2, 1, 1, 1, 2, 1, 0, 1, 1, 1, 1, 0, 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 2, 2, 2, 0, 2, 2, 1, 2, 1, 0, 2, 1, 1]
[2, 0, 2, 1, 0, 2, 1, 0, 1, 2, 1, 0, 0, 0, 2, 0, 0, 0, 1, 1, 2, 1, 2, 0, 0, 2, 0, 1, 1, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 0, 0, 2, 2, 0, 1, 0, 1, 1, 2, 0, 2, 2, 1, 0, 1, 1, 2, 2, 2, 1, 2, 2, 1, 0, 2, 1, 1, 1, 0, 2, 1, 0, 0, 2, 0, 2, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 2, 2, 2, 0, 1, 2, 2, 2, 1, 2, 0, 1, 1]


0.75

---
## Project Milestone #3 **(20 points)**


This is our final Milestone, here is an outline of what you need to do:


1. (Optional) You can try building a different model, say a Neural Network
2. Compare your models' performance on the validation set, and pick the best performing model.
3. Use your best model to make final prediction on the test set and submit the submission file



## (Optional) Building Additional Models

## (Optional) Comparing Different Model Performance

## Final submission (20 points)
Whoo-hoo! Finally, we are ready to submit our predictions on the test set!

Here is an outline of what you need to do in this part:
1. Comparing the performance of all your models and pick the best one
2. Train your best model one last time on the entire training+validation set using optimal settings
3. Load the test.csv file: https://drive.google.com/file/d/1UhYjXWwH5L4BzUPWB9Iq5wO05_2SyMdm/view?usp=sharing
4. Make final prediction on the test set
  * Make sure you need to do all the transformation on the test set:
    * One-hot-encoding
    * Selecting the same variables
    * etc
5. Your final output is a Pandas Dataframe
  * It should have only 2 columns: id, type
  * For each row in the test set
    * Keep the id and your prediction in the type column
  * Save your final Dataframe into a csv file
    * You can google pandas.DataFrame.to_csv and use that function
  * Make sure your submission file looks the same as this sample file: https://drive.google.com/file/d/16sjX5ofIbmqU3FbK5cRiVpwRwLppMQxg/view?usp=sharing
  * Rename your submission file into "yourname_submission.csv"
6. Upload both your Notebook and submission file onto the Google Classroom

In [None]:
# Final submission
X=train_features
X=scaler.transform(X)
y1=true_train_target.copy()
y2=true_train_target.copy()
y3=true_train_target.copy()

In [None]:
y1[y1=="Ghost"]=1
y1[y1!=1]=0
y2[y2=="Ghoul"]=1
y2[y2!=1]=0
y3[y3=="Goblin"]=1
y3[y3!=1]=0

In [None]:
y1=y1.astype(int, copy=True, errors='raise')
y2=y2.astype(int, copy=True, errors='raise')
y3=y3.astype(int, copy=True, errors='raise')

In [None]:
test=pd.read_csv("/content/drive/MyDrive/test.csv")
color=pd.get_dummies(test["color"])
test2=pd.concat([test,color],axis=1)
test2=test2.drop(["id","color"],axis=1)

In [None]:
test2.head()

Unnamed: 0,bone_length,rotting_flesh,hair_length,has_soul,black,blood,blue,clear,green,white
0,0.471774,0.387937,0.706087,0.698537,1,0,0,0,0,0
1,0.427332,0.645024,0.565558,0.451462,0,0,0,0,0,1
2,0.549602,0.491931,0.660387,0.449809,1,0,0,0,0,0
3,0.638095,0.682867,0.471409,0.356924,0,0,0,0,0,1
4,0.361762,0.583997,0.377256,0.276364,1,0,0,0,0,0


In [None]:
test2=scaler.transform(test2)

In [None]:
reg1=LogisticRegression()
reg2=LogisticRegression()
reg3=LogisticRegression()
reg1.fit(X,y1)
reg2.fit(X,y2)
reg3.fit(X,y3)
guiproba=reg1.predict_proba(test2)
shishiguiproba=reg2.predict_proba(test2)
gebulinproba=reg3.predict_proba(test2)
print(guiproba)
print(shishiguiproba)
print(gebulinproba)

[[9.99809193e-01 1.90807354e-04]
 [8.17287034e-01 1.82712966e-01]
 [9.93729010e-01 6.27099001e-03]
 ...
 [9.99753960e-01 2.46039850e-04]
 [9.08931555e-04 9.99091068e-01]
 [2.78780729e-02 9.72121927e-01]]
[[0.10060261 0.89939739]
 [0.72259382 0.27740618]
 [0.36430928 0.63569072]
 ...
 [0.07344319 0.92655681]
 [0.99669302 0.00330698]
 [0.98054418 0.01945582]]
[[0.50811304 0.49188696]
 [0.82121204 0.17878796]
 [0.65590372 0.34409628]
 ...
 [0.69208079 0.30791921]
 [0.87738628 0.12261372]
 [0.84018094 0.15981906]]


In [None]:
guiproba=guiproba[:,1]
shishiguiproba=shishiguiproba[:,1]
gebulinproba=gebulinproba[:,1]

In [None]:
zyn=[]
sjm=[]
yxy=[]
for i in range(len(ghost_proba)):
  zyn1=guiproba[i]
  zyn.append(zyn1)
  sjm1=shishiguiproba[i]
  sjm.append(sjm1)
  yxy1=gebulinproba[i]
  yxy.append(yxy1)
zyn=np.array(zyn)
sjm=np.array(sjm)
yxy=np.array(yxy)
zyn=zyn.reshape(-1,1)
sjm=sjm.reshape(-1,1)
yxy=yxy.reshape(-1,1)
yss=np.concatenate((zyn,sjm,yxy),axis=1)

target1=[]
for i in yss:
  target1.append(i.argmax())
print(target1)

[1, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 1, 2, 0, 1, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 2, 1, 0, 2, 2, 0, 2, 1, 1, 0, 1, 1, 0, 2, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 1, 1, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 2, 0, 1, 2, 1, 0, 1, 1, 1, 0, 1, 1, 0, 2]


In [None]:
last_solution=[]
for i in target1:
  if i==0:
    last_solution.append("Ghost")
  elif i==1:
    last_solution.append("Ghoul")
  else:
    last_solution.append("Goblin")
print(last_solution)

['Ghoul', 'Ghoul', 'Ghoul', 'Ghost', 'Ghost', 'Ghost', 'Ghoul', 'Ghoul', 'Goblin', 'Ghoul', 'Ghost', 'Ghoul', 'Goblin', 'Ghost', 'Ghoul', 'Ghost', 'Ghoul', 'Ghost', 'Ghost', 'Ghoul', 'Ghoul', 'Ghoul', 'Goblin', 'Goblin', 'Ghoul', 'Goblin', 'Ghoul', 'Goblin', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghoul', 'Goblin', 'Goblin', 'Ghoul', 'Ghost', 'Ghoul', 'Ghost', 'Ghoul', 'Ghost', 'Ghoul', 'Ghoul', 'Ghoul', 'Goblin', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghost', 'Goblin', 'Ghoul', 'Ghost', 'Goblin', 'Goblin', 'Ghost', 'Goblin', 'Ghoul', 'Ghoul', 'Ghost', 'Ghoul', 'Ghoul', 'Ghost', 'Goblin', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghost', 'Ghoul', 'Ghost', 'Ghost', 'Ghoul', 'Ghoul', 'Goblin', 'Ghoul', 'Ghoul', 'Goblin', 'Ghost', 'Goblin', 'Ghost', 'Goblin', 'Ghoul', 'Ghoul', 'Ghost', 'Ghost', 'Ghost', 'Goblin', 'Ghost', 'Ghost', 'Ghost', 'Ghost', 'Goblin', 'Goblin', 'Ghost', 'Ghost', 'Ghost', 'Goblin', 'Goblin', 'Ghost', 'Ghoul', 'Goblin', 'Ghoul', 'Ghost', 'Ghoul', 'Ghoul', 'Ghoul', 'Ghost', 'Gh

In [None]:
solu=pd.DataFrame(last_solution)
solu.to_csv("姚顺顺 final project solution（1）")

# Congratulations on finishing your first big project!!!