<h1 align = "center" > Classification of Startup Viability </h1>

**Project Introduction**

The objective of this project is to develop a classification model that identifies entities registered with StartupBlink as either "Viable" or "Not Viable." In this context, "Viable" refers to those entities that meet the data cleaning criteria established by StartupBlink and are thus retained in the system, while "Not Viable" signifies those that do not meet these standards and are consequently removed.

To achieve this classification, I utilized two datasets: test_task.csv and cleaned_test_task.csv.

test_task.csv is the original dataset provided by StartupBlink for the assessment of my skills in data cleaning and  processing.
cleaned_test_task.csv is the resultant dataset generated after applying the specified data cleaning metrics outlined by StartupBlink to the original dataset.

The classification of entities into "Viable" and "Not Viable" is determined by a binary target variable called classified. This column was created by extracting the UUIDs of the entities present in the cleaned dataset, which are classified as 1 (indicating viability), while those UUIDs not found in the cleaned dataset are classified as 0 (indicating non-viability) in the original dataset.

The independent variables used to construct the classification models include description, categories_list, city, region, country, continent, and tld. This comprehensive approach ensures that the model leverages key contextual information to make accurate viability predictions.

Five different models were developed to identify the most effective approach for classification:
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- Neural Network

In [24]:
# Importing libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from string import punctuation, digits
from nltk.corpus import stopwords
import ast

stopword = stopwords.words("english")

In [2]:
# reading the csv files into DataFrames
df = pd.read_csv("test_task.csv")
df_cleaned = pd.read_csv("cleaned_test_task.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,uuid,website,Name,description,valuation_usd,categories,location
0,0,5639c94c-408b-450a-9694-df238f6d9da9,https://www.perplexity.ai,Perplexity AI,Perplexity is a search engine platform that us...,520000000.0,"[{'entity_def_id': 'category', 'permalink': 'a...","[{'permalink': 'san-francisco-california', 'uu..."
1,1,00e32424-3cc3-4798-8361-78914c212fb0,https://mistral.ai,Mistral AI,Mistral AI is a platform that assembles team t...,1999462000.0,"[{'entity_def_id': 'category', 'permalink': 'a...",
2,2,c8326e69-b364-44af-a7a7-1f9239d532f1,https://robinai.co.uk,Robin AI,Robin AI is a legal infrastructure business th...,,"[{'entity_def_id': 'category', 'permalink': 'a...","[{'permalink': 'london-england', 'uuid': 'aad1..."
3,3,cf2c678c-b81a-80c3-10d1-9c5e76448e51,https://www.openai.com,OpenAI,OpenAI is an AI research and deployment compan...,29000000000.0,"[{'entity_def_id': 'category', 'permalink': 'a...","[{'permalink': 'san-francisco-california', 'uu..."
4,4,0f3fd8ce-9fe3-d6bd-d243-ffdf2fcfb012,http://www.startengine.com,StartEngine,StartEngine is an equity crowdfunding platform...,120000000.0,"[{'entity_def_id': 'category', 'permalink': 'c...","[{'permalink': 'los-angeles-california', 'uuid..."


**Data Cleaning and Feature Generation**

In [4]:
# removing the unnamed column as it holds no significance
df = df.drop(columns=['Unnamed: 0'])

# removing the valuation column because there are too many missing values
df = df.drop(columns=["valuation_usd"])

# removing rows with missing values in the following columns below
df = df.dropna(subset=['website', 'description', 'categories', 'location'])

# converting the string representation of lists to actual lists of dictionaries
df['categories'] = df['categories'].apply(ast.literal_eval)
# creating a new column (categories_list) by extracting and joining the "value" key into a single string
df['categories_list'] = df['categories'].apply(
    lambda x: ', '.join([category['value'] for category in x]) if isinstance(x, list) else ''
)
# Ensure the new column is of object dtype (string)
df['categories_list'] = df['categories_list'].astype('object')
# droping categories column as it is no longer needed
df = df.drop(columns=['categories'])


# defining a function to extract the value based on location_type
def extract_location(locations, loc_type):
    # Extract the first location value if available, else return None
    values = [loc['value'] for loc in locations if loc['location_type'] == loc_type]
    return values[0] if values else None

# converting location column into actual lists of dictionaries
df['location'] = df['location'].apply(ast.literal_eval)
# creating separate columns for city, region, and country
df['city'] = df['location'].apply(lambda x: extract_location(x, 'city'))
df['region'] = df['location'].apply(lambda x: extract_location(x, 'region'))
df['country'] = df['location'].apply(lambda x: extract_location(x, 'country'))
df['continent'] = df['location'].apply(lambda x: extract_location(x, 'continent'))
# droping location column as it is not needed
df = df.drop(columns=['location'])

# creating a new column (tld) by extrating the top level domains from the website column
df['tld'] = df['website'].str.extract(r'(?:https?://)?(?:www\.)?[^/]+\.(\w+)', expand=False) # using regex
# droping website column
df = df.drop(columns=['website'])

# creating a set of unique values from df_cleaned
df_cleaned_set = set(df_cleaned['uuid'])

# creating the 'classified' column in df based on the presence in df_cleaned
df['classified'] = df['uuid'].apply(lambda x: 1 if x in df_cleaned_set else 0)

In [5]:
df.head()

Unnamed: 0,uuid,Name,description,categories_list,city,region,country,continent,tld,classified
0,5639c94c-408b-450a-9694-df238f6d9da9,Perplexity AI,Perplexity is a search engine platform that us...,"Artificial Intelligence (AI), Chatbot, Generat...",San Francisco,California,United States,North America,ai,1
2,c8326e69-b364-44af-a7a7-1f9239d532f1,Robin AI,Robin AI is a legal infrastructure business th...,"Artificial Intelligence (AI), Contact Manageme...",London,England,United Kingdom,Europe,uk,0
3,cf2c678c-b81a-80c3-10d1-9c5e76448e51,OpenAI,OpenAI is an AI research and deployment compan...,"Artificial Intelligence (AI), Generative AI, M...",San Francisco,California,United States,North America,com,0
4,0f3fd8ce-9fe3-d6bd-d243-ffdf2fcfb012,StartEngine,StartEngine is an equity crowdfunding platform...,"Crowdfunding, Finance, Financial Services, Fin...",Los Angeles,California,United States,North America,com,0
5,99cc593e-96ee-48fe-824a-0ebc12a94a63,bitsCrunch,AI-enhanced decentralized data networks delive...,"Analytics, Artificial Intelligence (AI), Block...",Munich,Bayern,Germany,Europe,com,1


In [6]:
df["classified"].value_counts()

classified
0    6476
1    3321
Name: count, dtype: int64

**Preparation of Independent variables for model building**

In [8]:
# combining relevant text columns into a single "document" per row
text_columns = ['description', 'categories_list', 'city', 'region', 'country', 'continent', 'tld']
df['combined_text'] = df[text_columns].fillna('').agg(' '.join, axis=1)

In [11]:
df["combined_text"].iloc[0]

"Perplexity is a search engine platform that uses artificial intelligence to provide large language models and search engines. The company's platform enables the development of beneficial artificial general intelligence, as well as an open-source environment that is accessible to the public, allowing clients to develop skills and knowledge in artificial intelligence. Artificial Intelligence (AI), Chatbot, Generative AI, Search Engine, Software San Francisco California United States North America ai"

***Building a Bag-of-Words (BoW) model***

Three functions have been defined to process the text data for machine learning. 

1. extract_words(text)

This function tokenizes input text by isolating punctuation and digits, converting the text to lowercase, and splitting it into individual words. The processed words can then be used in further text analysis. It is a helper function for the bag_of_words function.

2. bag_of_words(texts, remove_stopword=False)

This function builds a vocabulary dictionary from a list of text documents. Each unique word gets assigned a unique index, which allows for representing each word in a feature vector (used for bag-of-words). If remove_stopword is True, it excludes words found in a stopword list.

3. extract_bow_feature_vectors(reviews, indices_by_word, binarize=True)

This function generates a feature matrix representing the Bag-of-Words model for a list of text documents. Each row corresponds to a document, and each column represents a word from the vocabulary dictionary. The function also allows for binarization, which indicates word presence or absence (useful for binary classification).

In [12]:
def extract_words(text):
    """
    Tokenizes the input text by isolating punctuation and digits, converting all text to lowercase,
    and splitting the text into individual words.

    Parameters:
    text (str): The input string to process.

    Returns:
    list: A list of lowercase words with punctuation and digits separated.
    """
    for c in punctuation + digits:
        text = text.replace(c, ' ' + c + ' ')
    return text.lower().split()

def bag_of_words(texts, remove_stopword=False):
    """
    Builds a vocabulary dictionary from a list of texts. Each unique word is assigned an index,
    allowing it to be represented in feature vectors. Optionally excludes stopwords.

    Parameters:
    texts (list of str): A list of text documents to process.
    remove_stopword (bool): If True, excludes words found in the 'stopword' list.

    Returns:
    dict: A dictionary mapping each unique word to a unique index, forming the vocabulary.
    """
    indices_by_word = {}
    for text in texts:
        word_list = extract_words(text)
        for word in word_list:
            if remove_stopword and word in stopword:  # Remove if in stopwords
                continue
            if word not in indices_by_word:
                indices_by_word[word] = len(indices_by_word)
    return indices_by_word

def extract_bow_feature_vectors(reviews, indices_by_word, binarize=True):
    """
    Creates a feature matrix for a list of reviews based on the provided vocabulary.
    Each row represents a document, and each column corresponds to a word in the vocabulary.
    Optionally binarizes the matrix to indicate word presence/absence.

    Parameters:
    reviews (list of str): A list of text documents to process.
    indices_by_word (dict): A dictionary mapping words to unique indices (vocabulary).
    binarize (bool): If True, converts non-zero entries in the feature matrix to 1.

    Returns:
    np.ndarray: A feature matrix with rows representing documents and columns representing words.
    """
    feature_matrix = np.zeros([len(reviews), len(indices_by_word)], dtype=np.float64)
    for i, text in enumerate(reviews):
        word_list = extract_words(text)
        for word in word_list:
            if word not in indices_by_word: 
                continue
            feature_matrix[i, indices_by_word[word]] += 1
    if binarize:
        feature_matrix[feature_matrix > 0] = 1
    return feature_matrix

**Defining Dependent (y) and Independent (X_bow) Variables**

In [13]:
# generating bag-of-words indices and feature matrix
combined_texts = df['combined_text'].tolist()
indices_by_word = bag_of_words(combined_texts)
X_bow = extract_bow_feature_vectors(combined_texts, indices_by_word)

# preparing target labels for logistic regression
y = df['classified'].values

**Model Development and Evaluation (Logistic Regression)**

In [17]:
# spliting the dataframe into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# calculating performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# displaying the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.6770
Precision: 0.5396
Recall: 0.4897
F1 Score: 0.5135
Confusion Matrix:
[[993 285]
 [348 334]]


**Model Devlopment and Evaluation (Decision Trees)**

In [19]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree classifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.6276
Precision: 0.4613
Recall: 0.4194
F1 Score: 0.4393
Confusion Matrix:
[[944 334]
 [396 286]]


**Model Development and Evaluation (Random Forest)**

In [23]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest classifier
model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
print(conf_matrix)


Accuracy: 0.6852
Precision: 0.5895
Recall: 0.3138
F1 Score: 0.4096
Confusion Matrix:
[[1129  149]
 [ 468  214]]


In [25]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

# Initialize and train the SVM classifier with a linear kernel
model = SVC(kernel='linear', class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.6526
Precision: 0.5007
Recall: 0.5191
F1 Score: 0.5097
Confusion Matrix:
[[925 353]
 [328 354]]


**Model Development and Evaluation(Neural Network)**

In [27]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.utils import class_weight

X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)

# Calculate class weights
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)

class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

# Build the neural network model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model with class weights
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1, class_weight=class_weight_dict)

# Predict on the test set
y_pred = (model.predict(X_test) > 0.5).astype("int32")

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
print(conf_matrix)

2024-11-01 10:54:37.531006: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 74ms/step - accuracy: 0.6659 - loss: 0.6521 - val_accuracy: 0.6594 - val_loss: 0.6117
Epoch 2/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 66ms/step - accuracy: 0.8170 - loss: 0.4063 - val_accuracy: 0.6735 - val_loss: 0.6337
Epoch 3/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 87ms/step - accuracy: 0.9494 - loss: 0.1450 - val_accuracy: 0.6824 - val_loss: 0.8908
Epoch 4/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 71ms/step - accuracy: 0.9940 - loss: 0.0293 - val_accuracy: 0.6811 - val_loss: 1.1955
Epoch 5/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 61ms/step - accuracy: 0.9998 - loss: 0.0052 - val_accuracy: 0.6913 - val_loss: 1.4065
Epoch 6/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 58ms/step - accuracy: 1.0000 - loss: 0.0015 - val_accuracy: 0.6862 - val_loss: 1.5053
Epoch 7/10
[1m2

**Interpretation of Best Model (Random Forest)**

The Random Forest model using class weights produced the best performance metrics:

**Accuracy**: 0.6852
This indicates that approximately 68.52% of the predictions made by the model were correct. While this accuracy might seem relatively good, it can still be improved upon. 

**Precision**: 0.5895
This precision score means that when the model predicted an entity as "Viable" (class 1), it was correct about 58.95% of the time. This indicates a moderate level of confidence in the positive predictions.

**Recall**: 0.3138
The recall score of 31.38% signifies that the model correctly identified only about 31.38% of all actual "Viable" entities (class 1). This indicates that the model may be missing a significant portion of the viable entities, which is a critical concern.

**F1 Score**: 0.4096
The F1 Score, which balances precision and recall, is 0.4096. This value reflects the trade-off between precision and recall, highlighting that while the model has reasonable precision, its ability to recall viable instances is quite low.

**Confusion Matrix**:
- **True Negatives (TN)**: 1129 – The model correctly identified 1129 entities as "Not Viable" (class 0).
- **False Positives (FP)**: 149 – The model incorrectly predicted 149 entities as "Viable" (class 1) when they were actually "Not Viable."
- **False Negatives (FN)**: 468 – The model failed to identify 468 actual "Viable" entities, predicting them as "Not Viable."
- **True Positives (TP)**: 214 – The model correctly identified 214 entities as "Viable."

**Summary**

Overall, while the Random Forest model with weighted classes achieved decent accuracy and precision, its low recall indicates a significant issue with identifying viable entities. Efforts to improve recall, such as further tuning the model or exploring different algorithms, may be necessary to enhance its predictive performance.