**Introduction**

Your client wants to develop a better understanding of unicorns, with the hope they can be early investors in future highly successful companies. They are particularly interested in the investment strategies of the three top unicorn investors: Sequoia Capital, Tiger Global Management, and Accel. 


In [3]:
# Import libraries and packages.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# Run this cell so pandas displays all columns
pd.set_option('display.max_columns', None)

In [5]:
companies = pd.read_csv('Modified_Unicorn_Companies.csv')

# Display the first five rows.
companies.head()


Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,180,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,100,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,95,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,46,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."


In [None]:
# Display the data types of the columns.
companies.dtypes

In [None]:
# Apply necessary datatype conversions.
companies['Date Joined'] = pd.to_datetime(companies['Date Joined'])

In [None]:
# Create the column Years To Unicorn.
companies['Years To Unicorn'] = companies['Date Joined'].dt.year - companies['Year Founded']

In [None]:
companies['Years To Unicorn'].describe()

In [None]:
# Isolate any rows where `Years To Unicorn` is negative
companies[companies['Years To Unicorn'] < 0]

In [None]:
# Replace InVision's `Year Founded` value with 2011
companies.loc[companies['Company']=='InVision', 'Year Founded'] = 2011

# Verify the change was made properly
companies[companies['Company']=='InVision']

In [None]:
# Recalculate all values in the `Years To Unicorn` column
companies['Years To Unicorn'] = companies['Date Joined'].dt.year - companies['Year Founded']

# Verify that there are no more negative values in the column
companies['Years To Unicorn'].describe()

In [None]:
# List provided by the company of the expected industry labels in the data
industry_list = ['Artificial intelligence', 'Other','E-commerce & direct-to-consumer', 'Fintech',\
       'Internet software & services','Supply chain, logistics, & delivery', 'Consumer & retail',\
       'Data management & analytics', 'Edtech', 'Health', 'Hardware','Auto & transportation', \
        'Travel', 'Cybersecurity','Mobile & telecommunications']

In [None]:
# Check which values are in `Industry` but not in `industry_list`
set(companies['Industry']) - set(industry_list)

In [None]:
# 1. Create `replacement_dict`
replacement_dict = {'Artificial Intelligence': 'Artificial intelligence',
                   'Data management and analytics': 'Data management & analytics',
                   'FinTech': 'Fintech'
                   }

# 2. Replace the incorrect values in the `Industry` column
companies['Industry'] = companies['Industry'].replace(replacement_dict)

# 3. Verify that there are no longer any elements in `Industry` that are not in `industry_list`
set(companies['Industry']) - set(industry_list)

In [None]:
# Isolate rows of all companies that have duplicates
companies[companies.duplicated(subset=['Company'], keep=False)]

In [None]:
# Drop rows of duplicate companies after their first occurrence
companies = companies.drop_duplicates(subset=['Company'], keep='first')

In [None]:
# Create new `High Valuation` column
# Use qcut to divide Valuation into 'high' and 'low' Valuation groups
companies['High Valuation'] = pd.qcut(companies['Valuation'], 2, labels = ['low', 'high'])

In [None]:
# Rank the continents by number of unicorn companies
companies['Continent'].value_counts()

In [None]:
# Create numeric `Continent Number` column
continent_dict = {'North America': 1,
                  'Asia': 2,
                  'Europe': 3,
                  'South America': 4,
                  'Oceania': 5,
                  'Africa': 6
                 }
companies['Continent Number'] = companies['Continent'].replace(continent_dict)
companies.head()

In [None]:
# Create `Country/Region Numeric` column
# Create numeric categories for Country/Region
companies['Country/Region Numeric'] = companies['Country/Region'].astype('category').cat.codes

In [None]:
# Convert `Industry` to numeric data
# Create dummy variables with Industry values
industry_encoded = pd.get_dummies(companies['Industry'])

# Combine `companies` DataFrame with new dummy Industry columns
companies = pd.concat([companies, industry_encoded], axis=1)

In [None]:
companies.head()

**Question: Which categorical encoding approach did you use for each variable? Why?**

* `Continent` - Ordinal label encoding was used because there was a hierarchical order to the categories.
* `Country/Region` - Nominal label encoding was used because there was not a hierarchical order the categories.
* `Industry` - Dummy encoding was used because there were not many different categories represented and they were all equally important.

**Question: How does label encoding change the data?**

Label encoding changes the data by assigning each category a unique number instead of a qualitative value. 

**Question: What are the benefits of label encoding?**

Label encoding is useful in machine learning models, because many types of machine learning require all variables to be of a numeric data type.

**Question: What are the disadvantages of label encoding?**

Label encoding may make it more difficult to directly interpet what a column value represents. Further, it may introduce unintended relationships between the categorical data in a dataset.

## Conclusion

**What are some key takeaways that you learned during this lab?**

* Input validation is essential for ensuring data is high quality and error-free.
* In practice, input validation requires trial and error to identify issues and determine the best way to fix them.
* There are benefits and disadvantages to both label encoding and dummy/one-hot encoding.
* The decision to use label encoding versus dummy/one-hot encoding needs to be made on a case-by-case basis.

