<h1 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
    Startup
</font>
</h1>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    In this question, we aim to design a model that can predict whether a startup will succeed or not.
</font>
</p>


<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Introduction to the Dataset
</font>
</h2>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    The initial file for this question contains two files named <code>train.csv</code> and <code>test.csv</code>, which are the training and test datasets, respectively.
    <br>
    The training dataset has 57615 rows and 12 columns (features), whose descriptions are provided in the table below.
</font>
</p>

<center>
<div align=center style="line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>

|      <b>Feature Name</b>       |                          <b>Feature Description</b>                           |
| :----------------------------: | :---------------------------------------------------------------------------: |
|       <code>name</code>        |                                 Company Name                                  |
|   <code>category_list</code>   |                           Company Business Category                           |
| <code>funding_total_usd</code> |                            Total Funding (in USD)                             |
|      <code>status</code>       | Company Status (The target variable, which you need to modify slightly later) |
|   <code>country_code</code>    |                                 Country Code                                  |
|    <code>state_code</code>     |                                  State Code                                   |
|      <code>region</code>       |                                    Region                                     |
|       <code>city</code>        |                                     City                                      |
|  <code>funding_rounds</code>   |                           Number of Funding Rounds                            |
|    <code>founded_at</code>     |                                 Date Founded                                  |
| <code>first_funding_at</code>  |                             Date of First Funding                             |
|  <code>last_funding_at</code>  |                             Date of Last Funding                              |

</font>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    The test dataset also has 8752 rows and its columns are similar to the training dataset, except that it does not have the <code>status</code> column.
</font>
</p>


<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Reading the Dataset
</font>
</h2>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    Initially, you need to import your required libraries. Then, based on the descriptions above, read the training and test datasets appropriately and perform the necessary preprocessing on them.
    <br>
    If you look closely at the data, the values in the <code>status</code> column are <code>operating</code>, <code>closed</code>, <code>acquired</code>, and <code>ipo</code>. 
    We consider a company successful if it is in one of the two statuses: <code>acquired</code> or <code>ipo</code>. The <code>closed</code> status means the startup has failed and the company is shut down, and the <code>operating</code> status means the company has not yet achieved success but has not gone bankrupt yet.
     Therefore, your model should ultimately output one of three numbers as a prediction: 0 (Failed and shut down), 1 (Not successful but not shut down), and 2 (Successful).
</font>
</p>


In [1]:
import os
import warnings
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.metrics import f1_score

warnings.filterwarnings("ignore")

In [2]:
# Reading/Loading the dataset files
train_path = os.path.join(os.getcwd(), 'train.csv')
test_path = os.path.join(os.getcwd(), 'test.csv')

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# Parse dates and coerce numerics
date_cols = ['founded_at', 'first_funding_at', 'last_funding_at']
for c in date_cols:
    train_df[c] = pd.to_datetime(train_df[c], errors='coerce')
    test_df[c] = pd.to_datetime(test_df[c], errors='coerce')

for c in ['funding_total_usd', 'funding_rounds']:
    train_df[c] = pd.to_numeric(train_df[c], errors='coerce')
    test_df[c] = pd.to_numeric(test_df[c], errors='coerce')

# Map target variable: closed→0, operating→1, acquired/ipo→2
def map_status_to_target(s):
    if s == 'closed':
        return 0
    if s == 'operating':
        return 1
    if s in ('acquired', 'ipo'):
        return 2
    return np.nan

y = train_df['status'].map(map_status_to_target).astype('float').astype('Int64')

# Keep a copy of features
X_full = train_df.drop(columns=['status']).copy()
X_test_full = test_df.copy()

# Drop any rows with unknown target labels
idx = y.notna()
X_full = X_full.loc[idx].copy()
y = y.loc[idx].astype(int)

print('Train shape:', train_df.shape, 'Test shape:', test_df.shape)
print('Dropped rows with unknown status:', int((~idx).sum()))

Train shape: (57616, 12) Test shape: (8752, 11)
Dropped rows with unknown status: 0


In [3]:
# Preprocessing step
def prepare_features(df, city_keep=None, state_keep=None, is_train=False):
    out = df.copy()

    # Main category from category_list
    out['category_main'] = out['category_list'].astype(str).str.split('|').str[0].str.strip()
    out.loc[out['category_main'].isna() | (out['category_main'] == '') | (out['category_main'].str.lower() == 'nan'), 'category_main'] = 'Unknown'

    # Category count
    cat_series = out['category_list'].astype(str)
    valid_cat = (cat_series.str.lower() != 'nan') & (cat_series.str.strip() != '')
    out['category_count'] = 0
    out.loc[valid_cat, 'category_count'] = cat_series[valid_cat].str.count('\|') + 1

    # Date-derived features
    out['days_to_first_funding'] = (out['first_funding_at'] - out['founded_at']).dt.days
    out['days_funding_span'] = (out['last_funding_at'] - out['first_funding_at']).dt.days
    out['age_at_last_funding'] = (out['last_funding_at'] - out['founded_at']).dt.days

    out['founded_year'] = out['founded_at'].dt.year
    out['first_funding_year'] = out['first_funding_at'].dt.year
    out['last_funding_year'] = out['last_funding_at'].dt.year

    out['founded_month'] = out['founded_at'].dt.month
    out['first_funding_month'] = out['first_funding_at'].dt.month
    out['last_funding_month'] = out['last_funding_at'].dt.month

    # Log transform funding and simple indicator
    out['log_funding_total_usd'] = np.log1p(out['funding_total_usd'])
    out['has_funding'] = (out['funding_rounds'] > 0).astype(int)

    # Reduce high-cardinality city and state_code using frequency thresholds
    if is_train:
        city_counts = out['city'].astype(str).value_counts()
        city_keep = set(city_counts[city_counts >= 50].index)
        state_counts = out['state_code'].astype(str).value_counts()
        state_keep = set(state_counts[state_counts >= 100].index)

    out['city_reduced'] = out['city'].astype(str).where(out['city'].astype(str).isin(city_keep), other='Other')
    out['state_code_reduced'] = out['state_code'].astype(str).where(out['state_code'].astype(str).isin(state_keep), other='Other')

    if is_train:
        return out, city_keep, state_keep
    return out

# Prepare train and test feature DataFrames
X_prepared, city_keep_set, state_keep_set = prepare_features(X_full, is_train=True)
X_test_prepared = prepare_features(X_test_full, city_keep=city_keep_set, state_keep=state_keep_set, is_train=False)

# Define feature lists
numeric_features = [
    'funding_total_usd', 'funding_rounds', 'log_funding_total_usd',
    'days_to_first_funding', 'days_funding_span', 'age_at_last_funding',
    'founded_year', 'first_funding_year', 'last_funding_year',
    'founded_month', 'first_funding_month', 'last_funding_month',
    'category_count', 'has_funding'
]
categorical_features = ['country_code', 'state_code_reduced', 'region', 'city_reduced', 'category_main']

feature_cols = numeric_features + categorical_features

# Final X matrices
X = X_prepared[feature_cols].copy()
X_test_final = X_test_prepared[feature_cols].copy()

<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Model Training
</font>
</h2>

<p style="text-align: justify;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    Now that you have cleaned the data, it's time to train a model that can predict the target variable of this problem.
</font>
</p>


In [4]:
# Model design
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Preprocessors
numeric_transformer_lr = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer_ohe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor_lr = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer_lr, numeric_features),
        ('cat', categorical_transformer_ohe, categorical_features),
    ],
    remainder='drop'
)

numeric_transformer_tree = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer_ord = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

preprocessor_tree = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer_tree, numeric_features),
        ('cat', categorical_transformer_ord, categorical_features),
    ],
    remainder='drop'
)

# Candidate models
models = {
    'logreg': Pipeline(steps=[('preprocessor', preprocessor_lr),
                              ('clf', LogisticRegression(solver='saga', penalty='l2', max_iter=4000,
                                                         class_weight='balanced', multi_class='ovr'))]),
    'linsvc': Pipeline(steps=[('preprocessor', preprocessor_lr),
                              ('clf', LinearSVC(C=1.0, class_weight='balanced'))]),
    'rf': Pipeline(steps=[('preprocessor', preprocessor_tree),
                          ('clf', RandomForestClassifier(n_estimators=500, max_depth=None,
                                                         min_samples_leaf=2, n_jobs=-1,
                                                         class_weight='balanced_subsample', random_state=42))]),
    'hgb': Pipeline(steps=[('preprocessor', preprocessor_tree),
                           ('clf', HistGradientBoostingClassifier(max_depth=8, learning_rate=0.1,
                                                                 max_iter=500, random_state=42))])
}

scores = {}
best_model = None
best_model_name = None
best_score = -1.0

for name, pipe in models.items():
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_valid)
    score = f1_score(y_valid, preds, average='macro')
    scores[name] = score
    print(f"Model {name} macro F1: {score:.4f}")
    if score > best_score:
        best_score = score
        best_model = pipe
        best_model_name = name

print(f"Selected best model: {best_model_name} with macro F1 = {best_score:.4f}")

# Use the best model for subsequent evaluation and submission
model = best_model

NameError: name 'HistGradientBoostingClassifier' is not defined

<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Evaluation Metric
</font>
</h2>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    The metric we have chosen for evaluating model performance is named <code>f1_score</code> (with the <code>macro</code> averaging method).
    <br>
    This metric is the measure for evaluating the quality of your model. In other words, the grading system also uses this same metric for scoring.
    <br>
    It is suggested that you evaluate your model's performance on the training or validation set based on this metric.
</font>
</p>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font color="red"><b color='red'>Attention:</b></font>
<font face="vazir" size=3>
    To receive a score for this question, your model's accuracy must be greater than the threshold of 0.4.
    If your model's accuracy is less than 0.4, your score will be 
    <b>zero</b>
    , otherwise, it will be calculated using the following formula:
</font>
</p>


In [None]:
# evaluate your model
y_valid_pred = model.predict(X_valid)
valid_f1 = f1_score(y_valid, y_valid_pred, average='macro')
print('Validation macro F1 (best model):', round(valid_f1, 4))

Validation macro F1: 0.0637


<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
 Prediction on Test Data and Output
</font>
</h2>

<p style="text-align: justify;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    Save your model's predictions on the test data in a dataframe in the following format.
</font>
</p>

<p style="text-align: justify;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    Note that the dataframe name must be <code>submission</code>; otherwise, the grading system will not be able to evaluate your output.
    This dataframe only includes one column named <code>status</code> and has 8752 rows.
    <br>
    For each row in the test dataset, you must have a predicted value, which is your model's predicted <code>status</code> value.
    For example, the table below shows the first 5 rows of the <code>submission</code> dataframe. These values are hypothetical and may be different in your answer.
</font>
</p>

<center>
<div align=center 
style="direction: ltr;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    
||<code>status</code>|
|:----:|:-----:|
|0|1|
|1|2|
|2|1|
|3|1|
|4|0|

</font>
</div>
</center>


In [None]:
# Generate predictions on test data and build submission
# Retrain the selected best model on full data for final predictions
# Rebuild best model using same selection process on full data

# Fit all candidates on full data and pick best by internal CV-like split
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

candidates = {
    'logreg': Pipeline(steps=[('preprocessor', preprocessor_lr),
                              ('clf', LogisticRegression(solver='saga', penalty='l2', max_iter=4000,
                                                         class_weight='balanced', multi_class='ovr'))]),
    'linsvc': Pipeline(steps=[('preprocessor', preprocessor_lr),
                              ('clf', LinearSVC(C=1.0, class_weight='balanced'))]),
    'rf': Pipeline(steps=[('preprocessor', preprocessor_tree),
                          ('clf', RandomForestClassifier(n_estimators=500, max_depth=None,
                                                         min_samples_leaf=2, n_jobs=-1,
                                                         class_weight='balanced_subsample', random_state=42))]),
    'hgb': Pipeline(steps=[('preprocessor', preprocessor_tree),
                           ('clf', HistGradientBoostingClassifier(max_depth=8, learning_rate=0.1,
                                                                 max_iter=500, random_state=42))])
}

best_name, best_pipe, best_val = None, None, -1.0
for name, pipe in candidates.items():
    pipe.fit(X_tr, y_tr)
    pred = pipe.predict(X_va)
    sc = f1_score(y_va, pred, average='macro')
    if sc > best_val:
        best_name, best_pipe, best_val = name, pipe, sc
print(f'Using {best_name} for final training (val macro F1 {best_val:.4f})')

best_pipe.fit(X, y)

y_test_pred = best_pipe.predict(X_test_final)
submission = pd.DataFrame({'status': y_test_pred.astype(int)})
print('Submission shape:', submission.shape)
submission.head()

Submission shape: (8752, 1)


Unnamed: 0,status
0,2
1,2
2,2
3,2
4,2


<h2 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
<b>Answer Builder Cell</b>
</font>
</h2>

<p style="text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
    Run the cell below to create the <code>result.zip</code> file. Note that before running the cell below, you must have saved the changes made in the notebook (<code>ctrl+s</code>); otherwise, your score will be changed to zero at the end of the competition.
    <br>
    Also, if you are using Colab to run this notebook file, before submitting the <code>result.zip</code> file, download the latest version of your notebook and place it inside the submission file.
</font>


In [None]:
import zipfile

if not os.path.exists(os.path.join(os.getcwd(), 'startup.ipynb')):
    %notebook -e startup.ipynb

def compress(file_names):
    print("File Paths:")
    print(file_names)
    compression = zipfile.ZIP_DEFLATED
    with zipfile.ZipFile("result.zip", mode="w") as zf:
        for file_name in file_names:
            zf.write('./' + file_name, file_name, compress_type=compression)

submission.to_csv('submission.csv', index=False)

file_names = ['startup.ipynb', 'submission.csv']
compress(file_names)

File Paths:
['startup.ipynb', 'submission.csv']
