<a href="https://colab.research.google.com/github/NikoHems/ML24/blob/main/Project_template_midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Midterm project submission
**Due April 5th (Friday) before class.** Counts for 1/3 of the overall project grade (10% of the final grade).

You should address all the questions relevant to your project.
You will not be graded based on the values of the model performance, but on whether or not you have applied the right methodology: formulated the business model, translated it into a right machine learning approach, analyzed your data, prepared it for modeling, applied at least two different machine learning algorithms, used cross validation for model tuning, justified your tuning metric, set up the proper machine learning pipeline without data leakage, evaluated your model using all the relevant metrics, and justified all your decisions.

If you have tried different approaches, please include them all, and not just the best one.
If doing some feature engineering or feature selection has improved your model, also please include all of the steps, not just the most successful ones.

You should submit the notebook with the code, output and explanations. The notebook should be executable and comprehensible.

The points will be deducted for the following reasons:
- data leakage
- unjustified decisions (no discussion on: choice of metric for optimization, choice of the number of components of PCA if used, or blind removal of outliers...)
- notebook not comprehensible
- notebook with incomplete output
- notebook not executable
- blind copy pasting from ChatGPT, if the copied code is not suitable for the task
- writing your own code (or copy pasting them from outside source) for simple functions that we covered and that already exist in `sklearn` (train test split, plain grid search, encoding of categorical variables,...), as this leads to:
    - convoluted code prone to bugs
    - code that is hard to understand and review
    - waste of data scientist's time if ready-made simple functions exist

Additional points will be awarded for trying and testing different relevant approaches, from exploratory data analysis, to feature engineering, to modeling and evaluation.

There should be one submission per group, but team member evaluation can be submitted per person. If not submitted, the default is that all the team members have contributed equally to the project and should get the same grade.

Here, fill out group number, student IDs, and project name.
### Group number:
### Student IDs:
### Project name:

## What business problem are you solving?
- Please state clearly what business problem are you solving. (one sentence)
- Elaborate why is this a relevant problem, and what can you do with the model output to create business value, i.e., how is the model output actionable. (2-3 paragraphs)

## What is the machine learning problem that you are solving?
- Please state clearly what is the ML problem. Is it a classification problem or a regression problem? Is the goal to build a model that generates a ranked list, or is it to detect anomalies as new data come in? Are you doing clustering to find hidden patterns?) (one sentence)
- If applicable state your target.

## Data exploration and preparation

- How many data instances do you have?
- Do you have duplicates?
- How many features? What type are they?
- If they are categorical, what categories they have, what is their frequency?
- If they are numerical, what is their distribution?
- What is the distribution of the target variable?
- If you have a target, you can also check the relationship between the target and the variables.
- Do you have missing data? If yes, how are you going to handle it?
- Can you use the features in their original form, or do you need to alter them in some way?
- What have you learned about your data? Is there anything that can help you in feature engineering or modeling?

If you have a lot of features, for midterm submission, you can choose to use only a subset. The same goes for data, if you use a reasonable subset of data for the problem at hand.

In [None]:
import pandas as pd
import numpy as np

### Data Loading and inspection

Lasst uns hier alle Variablen sammeln.

- df: original and uncleaned "investment_VC.csv" dataset
- df_clean: XXX



In [None]:
# Investments
df = pd.read_csv("investments_VC.csv", encoding='ISO-8859-1')

In [None]:
# Inspect data set
print("Investments shape is: ", df.shape)

# Last row with data manually insepcted is 49.439 so there are a lot of rows read as NaN´s

Investments shape is:  (12547, 39)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df.head(5)

In [None]:
df.info()

# Funding total usd is read as object instead of float/int, also data columns are not the right data type

In [None]:
duplicates_count = df['permalink'].duplicated(keep=False).sum()
print(f"Number of duplicated entries in 'permalink' column: {duplicates_count}")
# High number of duplicates come from a lot of empty rows, which we need to drop

### Analysing target column "Status"

In [None]:
# Target Column is Status, therefore we inspect it

status_counts = df['status'].value_counts(dropna=False)

# NaN or any other value not matching 'acquired', 'closed', or 'operating' will be treated as 'error'
check = len(df) - status_counts.get('acquired', 0) - status_counts.get('closed', 0) - status_counts.get('operating', 0)

missing_values = df["status"].isna().sum()

other = check - missing_values

# Display the counts
print(f"There are {status_counts.get('acquired', 0)} acquired companies")
print(f"There are {status_counts.get('closed', 0)} closed companies")
print(f"There are {status_counts.get('operating', 0)} operating companies")
print(f"There are {missing_values} NaN values")
print(f"There are {other} other values")

# Data Cleaning

### Deleting unnecessary columns

In [None]:
# To reduce unnecessary and redundant data, we delete columns that are not relevant for model development

#df.drop(["permalink", "homepage_url"], axis=1, inplace=True)

### Cleaning columns

In [None]:
# Clean numbers in funding_total_usd column

# Ensure column names are trimmed
df.columns = df.columns.str.strip()

# Directly convert 'funding_total_usd' to a float, thereby also transforming "-" to NaN
df['funding_total_usd'] = pd.to_numeric(df['funding_total_usd'].str.replace(',', ''), errors='coerce')

df["funding_total_usd"].dtype

df.head(10)


In [None]:
# Fix column names
df.rename(columns={' market ': 'market', ' funding_total_usd ': 'funding_total_usd'}, inplace=True)

In [None]:
# Turning all of our Date Columns into pd.datetime
df['founded_at'] =  pd.to_datetime(df['founded_at'], format='%Y-%m-%d', errors = 'coerce')
df['first_funding_at'] =  pd.to_datetime(df['first_funding_at'], format='%Y-%m-%d', errors = 'coerce')
df['last_funding_at'] =  pd.to_datetime(df['last_funding_at'], format='%Y-%m-%d', errors = 'coerce')
df['founded_year'] =  pd.to_datetime(df['founded_year'], format='%Y', errors = 'coerce')
df['founded_month'] =  pd.to_datetime(df['founded_month'], format='%Y-%m', errors = 'coerce')

In [None]:
# Cleaning the Market column, since it has unnecessary spaces
df['market'] = df['market'].str.strip()

### Check and drop NaN

In [None]:
# Count NaN values
print(df[['name', 'market', 'funding_total_usd', 'country_code', 'founded_year', "status"]].isna().sum())

# Since status is our target variable we will drop all NaN´s in this column
cleaned_df = df.dropna(subset=["status"])

# Check if it worked
print("Number of NaN: ", cleaned_df["status"].isna().sum())

In [None]:
# Additionally we want to filter out all startups that have less than one funding round

funding_round = df[['convertible_note','debt_financing', 'angel', 'grant', 'private_equity', 'post_ipo_equity',
       'post_ipo_debt', 'secondary_market', 'product_crowdfunding', 'round_A','round_B',
       'round_C', 'round_D', 'round_E', 'round_F', 'round_G','round_H']]

funding_round

# For the funding rounds will replace missing values with 0

df[['convertible_note','debt_financing', 'angel', 'grant', 'private_equity', 'post_ipo_equity',
       'post_ipo_debt', 'secondary_market', 'product_crowdfunding', 'round_A','round_B',
       'round_C', 'round_D', 'round_E', 'round_F', 'round_G','round_H']].fillna(0)

###
Check duplicates again after cleaning empty rows

In [None]:
cleaned_df.duplicated().sum()
# After cleaning empty rows there are no duplicates

Check for outliers for funding_total_USD

In [None]:
Q1 = df['funding_total_usd'].quantile(0.25)
Q3 = df['funding_total_usd'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers as those beyond 1.5 times the IQR from the Q1 and Q3
outliers_IQR = df[(df['funding_total_usd'] < (Q1 - 1.5 * IQR)) | (df['funding_total_usd'] > (Q3 + 1.5 * IQR))]

# Calculate Z-scores to identify outliers
z_scores = np.abs(stats.zscore(df['funding_total_usd']))
outliers_Z = df[z_scores > 3]

# Display the number of outliers detected by each method
outliers_IQR_count = outliers_IQR.shape[0]
outliers_Z_count = outliers_Z.shape[0]

outliers_IQR_count, outliers_Z_count

# Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

### Distribution of target variable "Status"

In [None]:
cleaned_df.groupby("status").size()

In [None]:
# Create a count plot
sns.countplot(x='status', data=cleaned_df)

# Set the title and labels for the plot
plt.title('Count of Statuses')
plt.xlabel('Status')
plt.ylabel('Count')

# Display the plot
plt.show()

Explanation:
- we can clearly see that operating status makes the largest chunk in this dataset (problem?) - maybe we need deeper look into operating status

In [None]:
# Deep dive into operating status
grouped = cleaned_df.groupby("status")["round_A"].count()

In [None]:
# First, group by 'status' and sum all the funding rounds
grouped = cleaned_df.groupby('status')[['convertible_note',
       'debt_financing', 'angel', 'grant', 'private_equity', 'post_ipo_equity',
       'post_ipo_debt', 'secondary_market', 'product_crowdfunding', 'round_A',
       'round_B', 'round_C', 'round_D', 'round_E', 'round_F', 'round_G',
       'round_H']].sum()

# Now plot a stacked bar chart
grouped.transpose().plot(kind='bar', stacked=True)

# Set the title and labels for the plot
plt.title('Composition of Operating Status by Funding Round')
plt.xlabel('Funding Round')
plt.ylabel('Amount')

# Display the plot
plt.show()

## Feature engineering
Creating good features is probably the most important step in the machine learning process.
This might involve doing:
- transformations
- aggregating over data points or over time and space, or finding differences (for example: differences between two monthly bills, time difference between two contacts with the client)
- creating dummy (binary) variables
- discretization

Business insight is very relevant in this process.


For the midterm submission, you should do all the pre-processing necessary for the model that you selected (for example encoding of categorical variables, scaling), and test at least one feature selection method (if the feature selection does not help, still show that you tried it out).

For the final submission you should try different approaches for feature scaling, feature selection,...
If it is possible you can also find additional relevant data.

In [None]:
#creating a copy of the dataset

df1 = df.copy()

In [None]:
df1.isnull().sum()

In [None]:
df1 = df1.dropna(subset=['permalink', 'status', 'name', 'market', 'country_code'])

In [None]:
df1.isnull().sum()

In [None]:
df1 = df1.drop(columns= ['homepage_url', 'category_list', 'state_code', 'founded_at', 'founded_month', 'founded_quarter', 'founded_year', 'city', 'region', 'first_funding_at', 'last_funding_at'])

### Industry Mapping

In [None]:
import re

In [None]:
# grouping markets in industries to decrease the number of segments. The list was being taken from here https://support.crunchbase.com/hc/en-us/articles/360043146954-What-Industries-are-included-in-Crunchbase-
admin_services = str('Employer Benefits Programs, Human Resource Automation, Corporate IT, Distribution, Service Providers, Archiving Service, Call Center, Collection Agency, College Recruiting, Courier Service, Debt Collections, Delivery, Document Preparation, Employee Benefits, Extermination Service, Facilities Support Services, Housekeeping Service, Human Resources, Knowledge Management, Office Administration, Packaging Services, Physical Security, Project Management, Staffing Agency, Trade Shows, Virtual Workforce').split(', ')
advertising = str('Creative Industries, Promotional, Advertising Ad Exchange, Ad Network, Ad Retargeting, Ad Server, Ad Targeting, Advertising, Advertising Platforms, Affiliate Marketing, Local Advertising, Mobile Advertising, Outdoor Advertising, SEM, Social Media Advertising, Video Advertising').split(', ')
agriculture = str('Agriculture, AgTech, Animal Feed, Aquaculture, Equestrian, Farming, Forestry, Horticulture, Hydroponics, Livestock').split(', ')
app = str('Application Performance Monitoring, App Stores, Application Platforms, Enterprise Application, App Discovery, Apps, Consumer Applications, Enterprise Applications, Mobile Apps, Reading Apps, Web Apps').split(', ')
artificial_intelli = str('Artificial Intelligence, Intelligent Systems, Machine Learning, Natural Language Processing, Predictive Analytics').split(', ')
biotechnology = str('Synthetic Biology, Bio-Pharm, Bioinformatics, Biometrics, Biopharma, Biotechnology, Genetics, Life Science, Neuroscience, Quantified Self').split(', ')
clothing = str('Fashion, Laundry and Dry-cleaning, Lingerie, Shoes').split(', ')
shopping = str('Consumer Behavior, Customer Support Tools, Discounts, Reviews and Recommendations, Auctions, Classifieds, Collectibles, Consumer Reviews, Coupons, E-Commerce, E-Commerce Platforms, Flash Sale, Gift, Gift Card, Gift Exchange, Gift Registry, Group Buying, Local Shopping, Made to Order, Marketplace, Online Auctions, Personalization, Point of Sale, Price Comparison, Rental, Retail, Retail Technology, Shopping, Shopping Mall, Social Shopping, Sporting Goods, Vending and Concessions, Virtual Goods, Wholesale').split(', ')
community = str("Self Development, Sex, Forums, Match-Making, Babies, Identity, Women, Kids, Entrepreneur, Networking, Adult, Baby, Cannabis, Children, Communities, Dating, Elderly, Family, Funerals, Humanitarian, Leisure, LGBT, Lifestyle, Men's, Online Forums, Parenting, Pet, Private Social Networking, Professional Networking, Q&A, Religion, Retirement, Sex Industry, Sex Tech, Social, Social Entrepreneurship, Teenagers, Virtual World, Wedding, Women's, Young Adults").split(', ')
electronics  = str('Mac, iPod Touch, Tablets, iPad, iPhone, Computer, Consumer Electronics, Drones, Electronics, Google Glass, Mobile Devices, Nintendo, Playstation, Roku, Smart Home, Wearables, Windows Phone, Xbox').split(', ')
consumer_goods= str('Commodities, Sunglasses, Groceries, Batteries, Cars, Beauty, Comics, Consumer Goods, Cosmetics, DIY, Drones, Eyewear, Fast-Moving Consumer Goods, Flowers, Furniture, Green Consumer Goods, Handmade, Jewelry, Lingerie, Shoes, Tobacco, Toys').split(', ')
content = str('E-Books, MicroBlogging, Opinions, Blogging Platforms, Content Delivery Network, Content Discovery, Content Syndication, Creative Agency, DRM, EBooks, Journalism, News, Photo Editing, Photo Sharing, Photography, Printing, Publishing, Social Bookmarking, Video Editing, Video Streaming').split(', ')
data = str('Optimization, A/B Testing, Analytics, Application Performance Management, Artificial Intelligence, Big Data, Bioinformatics, Biometrics, Business Intelligence, Consumer Research, Data Integration, Data Mining, Data Visualization, Database, Facial Recognition, Geospatial, Image Recognition, Intelligent Systems, Location Based Services, Machine Learning, Market Research, Natural Language Processing, Predictive Analytics, Product Research, Quantified Self, Speech Recognition, Test and Measurement, Text Analytics, Usability Testing').split(', ')
design = str('Visualization, Graphics, Design, Designers, CAD, Consumer Research, Data Visualization, Fashion, Graphic Design, Human Computer Interaction, Industrial Design, Interior Design, Market Research, Mechanical Design, Product Design, Product Research, Usability Testing, UX Design, Web Design').split(', ')
education = str('Universities, College Campuses, University Students, High Schools, All Students, Colleges, Alumni, Charter Schools, College Recruiting, Continuing Education, Corporate Training, E-Learning, EdTech, Education, Edutainment, Higher Education, Language Learning, MOOC, Music Education, Personal Development, Primary Education, Secondary Education, Skill Assessment, STEM Education, Textbook, Training, Tutoring, Vocational Education').split(', ')
energy = str('Gas, Natural Gas Uses, Oil, Oil & Gas, Battery, Biofuel, Biomass Energy, Clean Energy, Electrical Distribution, Energy, Energy Efficiency, Energy Management, Energy Storage, Fossil Fuels, Fuel, Fuel Cell, Oil and Gas, Power Grid, Renewable Energy, Solar, Wind Energy').split(', ')
events = str('Concerts, Event Management, Event Promotion, Events, Nightclubs, Nightlife, Reservations, Ticketing, Wedding').split(', ')
financial = str('Debt Collecting, P2P Money Transfer, Investment Management, Trading, Accounting, Angel Investment, Asset Management, Auto Insurance, Banking, Bitcoin, Commercial Insurance, Commercial Lending, Consumer Lending, Credit, Credit Bureau, Credit Cards, Crowdfunding, Cryptocurrency, Debit Cards, Debt Collections, Finance, Financial Exchanges, Financial Services, FinTech, Fraud Detection, Funding Platform, Gift Card, Health Insurance, Hedge Funds, Impact Investing, Incubators, Insurance, InsurTech, Leasing, Lending, Life Insurance, Micro Lending, Mobile Payments, Payments, Personal Finance, Prediction Markets, Property Insurance, Real Estate Investment, Stock Exchanges, Trading Platform, Transaction Processing, Venture Capital, Virtual Currency, Wealth Management').split(', ')
food = str('Specialty Foods, Bakery, Brewing, Cannabis, Catering, Coffee, Confectionery, Cooking, Craft Beer, Dietary Supplements, Distillery, Farmers Market, Food and Beverage, Food Delivery, Food Processing, Food Trucks, Fruit, Grocery, Nutrition, Organic Food, Recipes, Restaurants, Seafood, Snack Food, Tea, Tobacco, Wine And Spirits, Winery').split(', ')
gaming = str('Game, Games, Casual Games, Console Games, Contests, Fantasy Sports, Gambling, Gamification, Gaming, MMO Games, Online Games, PC Games, Serious Games, Video Games').split(', ')
government = str('Polling, Governance, CivicTech, Government, GovTech, Law Enforcement, Military, National Security, Politics, Public Safety, Social Assistance').split(', ')
hardware= str('Cable, 3D, 3D Technology, Application Specific Integrated Circuit (ASIC), Augmented Reality, Cloud Infrastructure, Communication Hardware, Communications Infrastructure, Computer, Computer Vision, Consumer Electronics, Data Center, Data Center Automation, Data Storage, Drone Management, Drones, DSP, Electronic Design Automation (EDA), Electronics, Embedded Systems, Field-Programmable Gate Array (FPGA), Flash Storage, Google Glass, GPS, GPU, Hardware, Industrial Design, Laser, Lighting, Mechanical Design, Mobile Devices, Network Hardware, NFC, Nintendo, Optical Communication, Playstation, Private Cloud, Retail Technology, RFID, RISC, Robotics, Roku, Satellite Communication, Semiconductor, Sensor, Sex Tech, Telecommunications, Video Conferencing, Virtual Reality, Virtualization, Wearables, Windows Phone, Wireless, Xbox').split(', ')
health_care = str('Senior Health, Physicians, Electronic Health Records, Doctors, Healthcare Services, Diagnostics, Alternative Medicine, Assisted Living, Assistive Technology, Biopharma, Cannabis, Child Care, Clinical Trials, Cosmetic Surgery, Dental, Diabetes, Dietary Supplements, Elder Care, Electronic Health Record (EHR), Emergency Medicine, Employee Benefits, Fertility, First Aid, Funerals, Genetics, Health Care, Health Diagnostics, Home Health Care, Hospital, Medical, Medical Device, mHealth, Nursing and Residential Care, Nutraceutical, Nutrition, Outpatient Care, Personal Health, Pharmaceutical, Psychology, Rehabilitation, Therapeutics, Veterinary, Wellness').split(', ')
it = str('Distributors, Algorithms, ICT, M2M, Technology, Business Information Systems, CivicTech, Cloud Data Services, Cloud Management, Cloud Security, CMS, Contact Management, CRM, Cyber Security, Data Center, Data Center Automation, Data Integration, Data Mining, Data Visualization, Document Management, E-Signature, Email, GovTech, Identity Management, Information and Communications Technology (ICT), Information Services, Information Technology, Intrusion Detection, IT Infrastructure, IT Management, Management Information Systems, Messaging, Military, Network Security, Penetration Testing, Private Cloud, Reputation, Sales Automation, Scheduling, Social CRM, Spam Filtering, Technical Support, Unified Communications, Video Chat, Video Conferencing, Virtualization, VoIP').split(', ')
internet = str('Online Identity, Cyber, Portals, Web Presence Management, Domains, Tracking, Web Tools, Curated Web, Search, Cloud Computing, Cloud Data Services, Cloud Infrastructure, Cloud Management, Cloud Storage, Darknet, Domain Registrar, E-Commerce Platforms, Ediscovery, Email, Internet, Internet of Things, ISP, Location Based Services, Messaging, Music Streaming, Online Forums, Online Portals, Private Cloud, Product Search, Search Engine, SEM, Semantic Search, Semantic Web, SEO, SMS, Social Media, Social Media Management, Social Network, Unified Communications, Vertical Search, Video Chat, Video Conferencing, Visual Search, VoIP, Web Browsers, Web Hosting').split(', ')
invest = str('Angel Investment, Banking, Commercial Lending, Consumer Lending, Credit, Credit Cards, Financial Exchanges, Funding Platform, Hedge Funds, Impact Investing, Incubators, Micro Lending, Stock Exchanges, Trading Platform, Venture Capital').split(', ')
manufacturing = str('Innovation Engineering, Civil Engineers, Heavy Industry, Engineering Firms, Systems, 3D Printing, Advanced Materials, Foundries, Industrial, Industrial Automation, Industrial Engineering, Industrial Manufacturing, Machinery Manufacturing, Manufacturing, Paper Manufacturing, Plastics and Rubber Manufacturing, Textiles, Wood Processing').split(', ')
media = str('Writers, Creative, Television, Entertainment, Media, Advice, Animation, Art, Audio, Audiobooks, Blogging Platforms, Broadcasting, Celebrity, Concerts, Content, Content Creators, Content Discovery, Content Syndication, Creative Agency, Digital Entertainment, Digital Media, DRM, EBooks, Edutainment, Event Management, Event Promotion, Events, Film, Film Distribution, Film Production, Guides, In-Flight Entertainment, Independent Music, Internet Radio, Journalism, Media and Entertainment, Motion Capture, Music, Music Education, Music Label, Music Streaming, Music Venues, Musical Instruments, News, Nightclubs, Nightlife, Performing Arts, Photo Editing, Photo Sharing, Photography, Podcast, Printing, Publishing, Reservations, Social Media, Social News, Theatre, Ticketing, TV, TV Production, Video, Video Editing, Video on Demand, Video Streaming, Virtual World').split(', ')
message = str('Unifed Communications, Chat, Email, Meeting Software, Messaging, SMS, Unified Communications, Video Chat, Video Conferencing, VoIP, Wired Telecommunications').split(', ')
mobile = str('Android, Google Glass, iOS, mHealth, Mobile, Mobile Apps, Mobile Devices, Mobile Payments, Windows Phone, Wireless').split(', ')
music = str('Audio, Audiobooks, Independent Music, Internet Radio, Music, Music Education, Music Label, Music Streaming, Musical Instruments, Podcast').split(', ')
resource = str('Biofuel, Biomass Energy, Fossil Fuels, Mineral, Mining, Mining Technology, Natural Resources, Oil and Gas, Precious Metals, Solar, Timber, Water, Wind Energy').split(', ')
navigation = str('Maps, Geospatial, GPS, Indoor Positioning, Location Based Services, Mapping Services, Navigation').split(', ')
other = str('Mass Customization, Monetization, Testing, Subscription Businesses, Mobility, Incentives, Peer-to-Peer, Nonprofits, Alumni, Association, B2B, B2C, Blockchain, Charity, Collaboration, Collaborative Consumption, Commercial, Consumer, Crowdsourcing, Customer Service, Desktop Apps, Emerging Markets, Enterprise, Ethereum, Franchise, Freemium, Generation Y, Generation Z, Homeless Shelter, Infrastructure, Knowledge Management, LGBT Millennials, Non Profit, Peer to Peer, Professional Services, Project Management, Real Time, Retirement, Service Industry, Sharing Economy, Small and Medium Businesses, Social Bookmarking, Social Impact, Subscription Service, Technical Support, Underserved Children, Universities').split(', ')
payment = str('Billing, Bitcoin, Credit Cards, Cryptocurrency, Debit Cards, Fraud Detection, Mobile Payments, Payments, Transaction Processing, Virtual Currency').split(', ')
platforms = str('Development Platforms, Android, Facebook, Google, Google Glass, iOS, Linux, macOS, Nintendo, Operating Systems, Playstation, Roku, Tizen, Twitter, WebOS, Windows, Windows Phone, Xbox').split(', ')
privacy = str('Digital Rights Management, Personal Data, Cloud Security, Corrections Facilities, Cyber Security, DRM, E-Signature, Fraud Detection, Homeland Security, Identity Management, Intrusion Detection, Law Enforcement, Network Security, Penetration Testing, Physical Security, Privacy, Security').split(', ')
services = str('Funeral Industry, English-Speaking, Spas, Plumbers, Service Industries, Staffing Firms, Translation, Career Management, Business Services, Services, Accounting, Business Development, Career Planning, Compliance, Consulting, Customer Service, Employment, Environmental Consulting, Field Support, Freelance, Intellectual Property, Innovation Management, Legal, Legal Tech, Management Consulting, Outsourcing, Professional Networking, Quality Assurance, Recruiting, Risk Management, Social Recruiting, Translation Service').split(', ')
realestate= str('Office Space, Self Storage, Brokers, Storage, Home Owners, Self Storage , Realtors, Home & Garden, Utilities, Home Automation, Architecture, Building Maintenance, Building Material, Commercial Real Estate, Construction, Coworking, Facility Management, Fast-Moving Consumer Goods, Green Building, Home and Garden, Home Decor, Home Improvement, Home Renovation, Home Services, Interior Design, Janitorial Service, Landscaping, Property Development, Property Management, Real Estate, Real Estate Investment, Rental Property, Residential, Self-Storage, Smart Building, Smart Cities, Smart Home, Timeshare, Vacation Rental').split(', ')
sales = str('Advertising, Affiliate Marketing, App Discovery, App Marketing, Brand Marketing, Cause Marketing, Content Marketing, CRM, Digital Marketing, Digital Signage, Direct Marketing, Direct Sales, Email Marketing, Lead Generation, Lead Management, Local, Local Advertising, Local Business, Loyalty Programs, Marketing, Marketing Automation, Mobile Advertising, Multi-level Marketing, Outdoor Advertising, Personal Branding, Public Relations, Sales, Sales Automation, SEM, SEO, Social CRM, Social Media Advertising, Social Media Management, Social Media Marketing, Sponsorship, Video Advertising').split(', ')
science = str('Face Recognition, New Technologies, Advanced Materials, Aerospace, Artificial Intelligence, Bioinformatics, Biometrics, Biopharma, Biotechnology, Chemical, Chemical Engineering, Civil Engineering, Embedded Systems, Environmental Engineering, Human Computer Interaction, Industrial Automation, Industrial Engineering, Intelligent Systems, Laser, Life Science, Marine Technology, Mechanical Engineering, Nanotechnology, Neuroscience, Nuclear, Quantum Computing, Robotics, Semiconductor, Software Engineering, STEM Education').split(', ')
software = str('Business Productivity, 3D Technology, Android, App Discovery, Application Performance Management, Apps, Artificial Intelligence, Augmented Reality, Billing, Bitcoin, Browser Extensions, CAD, Cloud Computing, Cloud Management, CMS, Computer Vision, Consumer Applications, Consumer Software, Contact Management, CRM, Cryptocurrency, Data Center Automation, Data Integration, Data Storage, Data Visualization, Database, Developer APIs, Developer Platform, Developer Tools, Document Management, Drone Management, E-Learning, EdTech, Electronic Design Automation (EDA), Embedded Software, Embedded Systems, Enterprise Applications, Enterprise Resource Planning (ERP), Enterprise Software, Facial Recognition, File Sharing, IaaS, Image Recognition, iOS, Linux, Machine Learning, macOS, Marketing Automation, Meeting Software, Mobile Apps, Mobile Payments, MOOC, Natural Language Processing, Open Source, Operating Systems, PaaS, Predictive Analytics, Presentation Software, Presentations, Private Cloud, Productivity Tools, QR Codes, Reading Apps, Retail Technology, Robotics, SaaS, Sales Automation, Scheduling, Sex Tech, Simulation, SNS, Social CRM, Software, Software Engineering, Speech Recognition, Task Management, Text Analytics, Transaction Processing, Video Conferencing, Virtual Assistant, Virtual Currency, Virtual Desktop, Virtual Goods, Virtual Reality, Virtual World, Virtualization, Web Apps, Web Browsers, Web Development').split(', ')
sports = str('American Football, Baseball, Basketball, Boating, Cricket, Cycling, Diving, eSports, Fantasy Sports, Fitness, Golf, Hockey, Hunting, Outdoors, Racing, Recreation, Rugby, Sailing, Skiing, Soccer, Sporting Goods, Sports, Surfing, Swimming, Table Tennis, Tennis, Ultimate Frisbee, Volley Ball').split(', ')
sustainability = str('Green, Wind, Biomass Power Generation, Renewable Tech, Environmental Innovation, Renewable Energies, Clean Technology, Biofuel, Biomass Energy, Clean Energy, CleanTech, Energy Efficiency, Environmental Engineering, Green Building, Green Consumer Goods, GreenTech, Natural Resources, Organic, Pollution Control, Recycling, Renewable Energy, Solar, Sustainability, Waste Management, Water Purification, Wind Energy').split(', ')
transportation = str('Taxis, Air Transportation, Automotive, Autonomous Vehicles, Car Sharing, Courier Service, Delivery Service, Electric Vehicle, Ferry Service, Fleet Management, Food Delivery, Freight Service, Last Mile Transportation, Limousine Service, Logistics, Marine Transportation, Parking, Ports and Harbors, Procurement, Public Transportation, Railroad, Recreational Vehicles, Ride Sharing, Same Day Delivery, Shipping, Shipping Broker, Space Travel, Supply Chain Management, Taxi Service, Transportation, Warehousing, Water Transportation').split(', ')
travel = str('Adventure Travel, Amusement Park and Arcade, Business Travel, Casino, Hospitality, Hotel, Museums and Historical Sites, Parks, Resorts, Timeshare, Tour Operator, Tourism, Travel, Travel Accommodations, Travel Agency, Vacation Rental').split(', ')
video = str('Animation, Broadcasting, Film, Film Distribution, Film Production, Motion Capture, TV, TV Production, Video, Video Editing, Video on Demand, Video Streaming').split(', ')

In [None]:
#Making new column called  Industry group
df1['Industry_Group'] = pd.np.where(df1.market.str.contains('|'.join(admin_services), flags=re.IGNORECASE), "Administrative Services",
                               pd.np.where(df1.market.str.contains('|'.join(software), flags=re.IGNORECASE), "Software",
                               pd.np.where(df1.market.str.contains('|'.join(advertising), flags=re.IGNORECASE), "Advertising",
                               pd.np.where(df1.market.str.contains('|'.join(agriculture), flags=re.IGNORECASE), "Agriculture and Farming",
                               pd.np.where(df1.market.str.contains('|'.join(app), flags=re.IGNORECASE), "Apps",
                               pd.np.where(df1.market.str.contains('|'.join(artificial_intelli), flags=re.IGNORECASE), "Artificial Intelligence",
                               pd.np.where(df1.market.str.contains('|'.join(biotechnology), flags=re.IGNORECASE), "Biotechnology",
                               pd.np.where(df1.market.str.contains('|'.join(clothing), flags=re.IGNORECASE), "Clothing and Apparel",
                               pd.np.where(df1.market.str.contains('|'.join(shopping), flags=re.IGNORECASE), "Commerce and Shopping",
                               pd.np.where(df1.market.str.contains('|'.join(community), flags=re.IGNORECASE), "Community and Lifestyle",
                               pd.np.where(df1.market.str.contains('|'.join(electronics), flags=re.IGNORECASE), "Consumer Electronics",
                               pd.np.where(df1.market.str.contains('|'.join(consumer_goods), flags=re.IGNORECASE), "Consumer Goods",
                               pd.np.where(df1.market.str.contains('|'.join(content), flags=re.IGNORECASE), "Content and Publishing",
                               pd.np.where(df1.market.str.contains('|'.join(data), flags=re.IGNORECASE), "Data and Analytics",
                               pd.np.where(df1.market.str.contains('|'.join(design), flags=re.IGNORECASE), "Design",
                               pd.np.where(df1.market.str.contains('|'.join(education), flags=re.IGNORECASE), "Education",
                               pd.np.where(df1.market.str.contains('|'.join(energy), flags=re.IGNORECASE), "Energy",
                               pd.np.where(df1.market.str.contains('|'.join(events), flags=re.IGNORECASE), "Events",
                               pd.np.where(df1.market.str.contains('|'.join(financial), flags=re.IGNORECASE), "Financial Services",
                               pd.np.where(df1.market.str.contains('|'.join(food), flags=re.IGNORECASE), "Food and Beverage",
                               pd.np.where(df1.market.str.contains('|'.join(gaming), flags=re.IGNORECASE), "Gaming",
                               pd.np.where(df1.market.str.contains('|'.join(government), flags=re.IGNORECASE), "Government and Military",
                               pd.np.where(df1.market.str.contains('|'.join(hardware), flags=re.IGNORECASE), "Hardware",
                               pd.np.where(df1.market.str.contains('|'.join(health_care), flags=re.IGNORECASE), "Health Care",
                               pd.np.where(df1.market.str.contains('|'.join(it), flags=re.IGNORECASE), "Information Technology",
                               pd.np.where(df1.market.str.contains('|'.join(internet), flags=re.IGNORECASE), "Internet Services",
                               pd.np.where(df1.market.str.contains('|'.join(invest), flags=re.IGNORECASE), "Lending and Investments",
                               pd.np.where(df1.market.str.contains('|'.join(manufacturing), flags=re.IGNORECASE), "Manufacturing",
                               pd.np.where(df1.market.str.contains('|'.join(media), flags=re.IGNORECASE), "Media and Entertainment",
                               pd.np.where(df1.market.str.contains('|'.join(message), flags=re.IGNORECASE), "Messaging and Telecommunication",
                               pd.np.where(df1.market.str.contains('|'.join(mobile), flags=re.IGNORECASE), "Mobile",
                               pd.np.where(df1.market.str.contains('|'.join(music), flags=re.IGNORECASE), "Music and Audio",
                               pd.np.where(df1.market.str.contains('|'.join(resource), flags=re.IGNORECASE), "Natural Resources",
                               pd.np.where(df1.market.str.contains('|'.join(navigation), flags=re.IGNORECASE), "Navigation and Mapping",
                               pd.np.where(df1.market.str.contains('|'.join(payment), flags=re.IGNORECASE), "Payments",
                               pd.np.where(df1.market.str.contains('|'.join(platforms), flags=re.IGNORECASE), "Platforms",
                               pd.np.where(df1.market.str.contains('|'.join(privacy), flags=re.IGNORECASE), "Privacy and Security",
                               pd.np.where(df1.market.str.contains('|'.join(services), flags=re.IGNORECASE), "Professional Services",
                               pd.np.where(df1.market.str.contains('|'.join(realestate), flags=re.IGNORECASE), "Real Estate",
                               pd.np.where(df1.market.str.contains('|'.join(sales), flags=re.IGNORECASE), "Sales and Marketing",
                               pd.np.where(df1.market.str.contains('|'.join(science), flags=re.IGNORECASE), "Science and Engineering",
                               pd.np.where(df1.market.str.contains('|'.join(sports), flags=re.IGNORECASE), "Sports",
                               pd.np.where(df1.market.str.contains('|'.join(sustainability), flags=re.IGNORECASE), "Sustainability",
                               pd.np.where(df1.market.str.contains('|'.join(transportation), flags=re.IGNORECASE), "Transportation",
                               pd.np.where(df1.market.str.contains('|'.join(travel), flags=re.IGNORECASE), "Travel and Tourism",
                               pd.np.where(df1.market.str.contains('|'.join(video), flags=re.IGNORECASE), "Video",
                               pd.np.where(df1.market.str.contains('|'.join(other), flags=re.IGNORECASE), "Other",  "Other")))))))))))))))))))))))))))))))))))))))))))))))

In [None]:
pd.DataFrame(df1['Industry_Group'].unique())

In [None]:
print("Number of unique industries:", df1['Industry_Group'].nunique())

In [None]:
df1.groupby('Industry_Group').size().sort_values(ascending = False).reset_index()

### Country Mapping

In [None]:
country = pd.read_csv('continents2.csv')
country.rename(columns = {"region": "continent"}, inplace = True)
country = country[['continent', 'alpha-3']]

In [None]:
df1 = df1.merge(country, left_on='country_code', right_on='alpha-3')

In [None]:
df1.groupby('continent').size()

### Discretization and Binning

In [None]:
df2 = df1.copy()

In [None]:
df2 = df2.drop(['alpha-3', 'country_code', 'market'], axis=1)

In [None]:
df2["funding_total_usd"].fillna(0, inplace = True)

In [None]:
df2[['funding_rounds', 'seed', 'venture', 'equity_crowdfunding',
       'undisclosed', 'convertible_note', 'debt_financing', 'angel', 'grant',
       'private_equity', 'post_ipo_equity', 'post_ipo_debt',
       'secondary_market', 'product_crowdfunding', 'round_A', 'round_B',
       'round_C', 'round_D', 'round_E', 'round_F', 'round_G', 'round_H',
      'funding_total_usd']].describe().T

In [None]:
group_labels_1 = ["low", "medium", "medium-high", "high"]

df2["category_total"] = pd.qcut(df2["funding_total_usd"], 4, labels = group_labels_1)

In [None]:
df2.groupby("category_total").size()

In [None]:
group_labels_2 = ["low", "high"]

df2["category_funding_rounds"] = pd.cut(df2["funding_rounds"], bins = [-1, 2, 18], labels = group_labels_2)

In [None]:
df2.groupby("category_funding_rounds").size()

In [None]:
group_labels_3 = ["low", "high"]

df2["category_seed"] = pd.cut(df2["seed"], bins = [-1, 28000, 130000000], labels = group_labels_3)

In [None]:
df2.groupby("category_seed").size()

In [None]:
group_labels_4 = ["low", "medium", "high"]

df2["category_venture"] = pd.cut(df2["venture"], bins = [-1, 85038.5, 6000000, 2351000000], labels = group_labels_4)

In [None]:
df2.groupby("category_venture").size()

In [None]:
df2['category_status'] = df2['status'].replace(['closed', 'operating', 'acquired'], [0, 1, 2])
df2['category_total'] = df2['category_total'].replace(['low','medium','medium-high','high'], [0, 1, 2, 3])
df2['category_funding_rounds'] = df2['category_funding_rounds'].replace(['low', 'high'], [0, 1])
df2['category_seed'] = df2['category_seed'].replace(['low', 'high'], [0, 1])
df2['category_venture'] = df2['category_venture'].replace(['low','medium','high'], [0, 1, 3])

In [None]:
def categorize_equity_crowdfunding(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_equity_crowdfunding"] = df2["equity_crowdfunding"].apply(categorize_equity_crowdfunding)


In [None]:
def categorize_undisclosed(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_undisclosed"] = df2["undisclosed"].apply(categorize_undisclosed)


In [None]:
def categorize_equity_convertible_note(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_convertible_note"] = df2["convertible_note"].apply(categorize_equity_convertible_note)

In [None]:
def categorize_debt_financing(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_debt_financing"] = df2["debt_financing"].apply(categorize_debt_financing)

In [None]:
def categorize_angel(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_angel"] = df2["angel"].apply(categorize_angel)

In [None]:
def categorize_grant(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_grant"] = df2["grant"].apply(categorize_grant)

In [None]:
def categorize_private_equity(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_private_equity"] = df2["private_equity"].apply(categorize_private_equity)

In [None]:
def categorize_post_ipo_equity(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_post_ipo_equity"] = df2["post_ipo_equity"].apply(categorize_post_ipo_equity)

In [None]:
def categorize_post_ipo_debt(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_post_ipo_debt"] = df2["post_ipo_debt"].apply(categorize_post_ipo_debt)

In [None]:
def categorize_secondary_market(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_secondary_market"] = df2["secondary_market"].apply(categorize_secondary_market)

In [None]:
def categorize_product_crowdfunding(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_product_crowdfunding"] = df2["product_crowdfunding"].apply(categorize_product_crowdfunding)

In [None]:
def categorize_round_A(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_A"] = df2["round_A"].apply(categorize_round_A)

In [None]:
def categorize_round_B(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_B"] = df2["round_B"].apply(categorize_round_B)

In [None]:
def categorize_round_C(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_C"] = df2["round_C"].apply(categorize_round_C)

In [None]:
def categorize_round_D(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_D"] = df2["round_D"].apply(categorize_round_D)

In [None]:
def categorize_round_E(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_E"] = df2["round_E"].apply(categorize_round_E)

In [None]:
def categorize_round_F(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_F"] = df2["round_F"].apply(categorize_round_F)

In [None]:
def categorize_round_G(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_G"] = df2["round_G"].apply(categorize_round_G)

In [None]:
def categorize_round_H(value):
    if value < 1:
        return 0
    elif value > 1:
        return 1
    else:
        return None

df2["category_round_H"] = df2["round_H"].apply(categorize_round_H)

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()


df2['category_continent'] = labelencoder.fit_transform(df2['continent']) # using label encoder on continent
df2['category_Industry_Group'] = labelencoder.fit_transform(df2['Industry_Group'])

In [None]:
df3 = df2[['category_status', 'category_Industry_Group',
       'category_continent', 'category_funding_rounds',
       'category_total',
       'category_equity_crowdfunding', 'category_venture', 'category_seed', 'category_undisclosed',
       'category_convertible_note', 'category_debt_financing', 'category_angel', 'category_grant',
       'category_private_equity', 'category_post_ipo_equity', 'category_post_ipo_debt',
       'category_secondary_market', 'category_product_crowdfunding', 'category_round_A',
       'category_round_B', 'category_round_C', 'category_round_D', 'category_round_E',
       'category_round_F', 'category_round_G', 'category_round_H']]

df3.head()

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


In [1]:
dfjkbsdfailuffdiugiuFBIZUghifs<


SyntaxError: invalid syntax (<ipython-input-1-b857f0875229>, line 1)

## Modeling
You should implemente AT LEAST TWO approaches we covered so far, and tune of at least two hyperparameters of each approach.
Do not forget that you should split your data into train and test set.
You should do model selection and tuning using cross validation on the train set, avoiding data leakage.
Explain and justify what is the metric you are using for model selection and tuning.

## Model evaluation

After selecting and tuning your model on the train set, you should evaluate its performance on the test set.
You might have tuned your model using a certain metric, but now you should describe the model performance using all relevant metrics.
If you have some business insight, why a certain metric is relevant, you should explain it.
For example, in disease detection, you might not want to have a false positive rate higher than some threshold (say 5%).
Choose a suitable baseline to benchmark your result and to put them in the context.

If you preformed clustering, you should describe and visualize your clusters.

## Next steps

If you already have an idea what are the next steps for your project, please describe them.
This part is not graded, but can help us give you feedback.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5a7e697c-3af8-4510-a4e4-54669be63340' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>