# Overview

---

This notebook has the goal of teaching you how to use One-Hot encoding in Python for Numpy arrays and Pandas dataframes.

# Import libraries and dataset

In [16]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [8]:
# Enable downloading of public Google Drive files
!pip install --upgrade --no-cache-dir gdown==4.5.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown==4.5.1
  Downloading gdown-4.5.1.tar.gz (14 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: gdown
  Building wheel for gdown (PEP 517) ... [?25l[?25hdone
  Created wheel for gdown: filename=gdown-4.5.1-py3-none-any.whl size=14951 sha256=62dc6dadcf9878c89c5ec8699ccfd35243c9953f9885122b08093afd71801e85
  Stored in directory: /tmp/pip-ephem-wheel-cache-s2gfiwa9/wheels/3d/ec/b0/a96d1d126183f98570a785e6bf8789fca559853a9260e928e1
Successfully built gdown
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.5.1


In [9]:
# Download the dataset
!gdown 1MnfPTCIApAvdjNHxKwomUd0x7JD00rbI

Downloading...
From: https://drive.google.com/uc?id=1MnfPTCIApAvdjNHxKwomUd0x7JD00rbI
To: /content/startups.csv
  0% 0.00/2.44k [00:00<?, ?B/s]100% 2.44k/2.44k [00:00<00:00, 7.11MB/s]


In [10]:
dataset_pandas = pd.read_csv('startups.csv')
dataset_numpy = dataset_pandas.to_numpy()

# What is the dataset

---

This dataset consists of 50 startups, with their money spend for Research and development, Administration and Marketing, the state where they are located and the profit they've made. 

The task would be to predict the profit of a startup, based on the first 4 parameters

The first 3 parameters are AI friendly, but the state parameter is a string. In order to feed it to an algorithm, we need to encode it. For the job, one-hot encoding will do the best job.

In [11]:
dataset_pandas

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


# Encoding a Pandas dataframe

In [20]:
# [3] is the column number that you want to encode
transformer = ColumnTransformer( transformers=[('OneHot',OneHotEncoder(),[3])],remainder="passthrough")
encoded_df = pd.DataFrame(transformer.fit_transform(dataset_pandas))

The encoded variable is now on the first 3 positions of the dataset with a column for each state

In [21]:
encoded_df

Unnamed: 0,0,1,2,3,4,5,6
0,0.0,0.0,1.0,165349.2,136897.8,471784.1,192261.83
1,1.0,0.0,0.0,162597.7,151377.59,443898.53,191792.06
2,0.0,1.0,0.0,153441.51,101145.55,407934.54,191050.39
3,0.0,0.0,1.0,144372.41,118671.85,383199.62,182901.99
4,0.0,1.0,0.0,142107.34,91391.77,366168.42,166187.94
5,0.0,0.0,1.0,131876.9,99814.71,362861.36,156991.12
6,1.0,0.0,0.0,134615.46,147198.87,127716.82,156122.51
7,0.0,1.0,0.0,130298.13,145530.06,323876.68,155752.6
8,0.0,0.0,1.0,120542.52,148718.95,311613.29,152211.77
9,1.0,0.0,0.0,123334.88,108679.17,304981.62,149759.96


# Encoding a numpy array

In [22]:
# [3] is the column number that you want to encode
transformer = ColumnTransformer( transformers=[('OneHot',OneHotEncoder(),[3])],remainder="passthrough")
encoded_df = np.array(transformer.fit_transform(dataset_numpy))

The encoded variable is now on the first 3 positions of the dataset with a column for each state

In [23]:
encoded_df

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1, 192261.83],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53, 191792.06],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54, 191050.39],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62, 182901.99],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42, 166187.94],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36, 156991.12],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82, 156122.51],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68, 155752.6],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29, 152211.77],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62, 149759.96],
       [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95, 146121.95],
       [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55, 144259.4],
       [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44, 141585.52],
       [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93, 134307.35],
       [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92, 1326