# Task
Generate Python code to build and evaluate a RandomForest classification model to predict user preference for pickup & delivery service from the survey data in the file "/content/Tailorswift response - Form responses 1.csv". The code should include data loading, cleaning, label encoding, data splitting, model training, evaluation, and feature importance analysis, presented in a Google Colab-compatible format.

## Load the data

### Subtask:
Load the CSV file into a pandas DataFrame.


**Reasoning**:
Import pandas, load the CSV file into a DataFrame, display the first 5 rows, and show column names and data types.



In [2]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
df = pd.read_csv("/content/Tailorswift response - Form responses 1.csv")

# Display the first 5 rows of the DataFrame
display(df.head())

# Display the column names and their data types
display(df.info())

Unnamed: 0,Timestamp,Q1. Name,Q2. Your Age,Q3. Gender,Q4. Location City,Q5. Do you prefer stitched or ready-made?\nक्या आप सिले हुए कपड़े पसंद करते हैं या रेडीमेड?,Q6. How often do you visit for clothes stitching per month?\nआप प्रति माह कपड़े सिलवाने के लिए कितनी बार दर्ज़ी के पास जाते हैं?,Q7. What kind of clothes do you mostly stitch?\nआप आमतौर पर किस प्रकार के कपड़े सिलवाते हैं?,Q8. Who wears stitched clothes regularly?\nआपके परिवार में कौन सिले हुए कपड़े पहनता है?,Q9. Who is your tailor?\nआपका दर्ज़ी कौन है?,Q10. Which problems did you face with tailors?\nआपको दर्ज़ी से जुड़ी कौन-कौन सी समस्याओं का सामना करना पड़ा है?,Q11. How many times do you have to visit for changes in order?\nएक ऑर्डर में बदलाव के लिए आपको कितनी बार दर्ज़ी के पास जाना पड़ता है?,Q12. Who usually goes to the tailor in your house?\nआपके घर में आमतौर पर दर्ज़ी के पास कौन जाता है?,"Q13. Would you like us to visit your place for measurement/stitching advice/ consultancy?\nक्या आप चाहेंगे कि हम माप लेने, सिलाई की सलाह या कंसल्टेंसी के लिए आपके घर आएं?",Q14. Would you like pickup & delivery instead of visiting the tailor?\nक्या आप दर्ज़ी के पास जाने की बजाय कपड़ों की पिकअप और डिलीवरी सेवा पसंद करेंगे?,Q15. What is the most important thing in tailoring?\nटेलरिंग में आपके लिए सबसे महत्वपूर्ण चीज़ क्या है?\n(A) Most Preferred,(B) Second most preferred,(C) Least Preferred
0,12/07/2025 14:08:38,Arshdeep Singh,15-25,Male,"Punjab, Bathinda",both,0-3 times,"Pants, Kurta/Pajama, Shirt","Self, Kids, Parents, All of above","Nearby local tailor, Boutique",Tailor attitude,Once,The person who needs to stitch,No,Yes,Perfect fitting,Fast Delivery,Low cost
1,12/07/2025 14:26:29,Prince Singh,15-25,Male,Gorakhpur,Ready-made,0-3 times,Pants,Parents,Nearby local tailor,"Visit multiple times, Expensive",2 - 3 times,"The person who needs to stitch, Self",Yes,Yes,Perfect fitting,Fast Delivery,Low cost
2,12/07/2025 15:30:04,Janvi,15-25,Female,"aurangabad, maharashtra",Ready-made,0-3 times,Alterations,Occasionlly/events or work releated wearer,"Family Tailor, Nearby local tailor, At home","Late delivery, Bad fitting, Visit multiple times",Once,dad,Yes,Yes,Perfect fitting,Fast Delivery,Low cost
3,12/07/2025 15:41:53,Janvi Muummy,31-45,Female,"Aurangabad, maharashtra",Ready-made,0-3 times,"Blouse, Alterations, mostly alteration work",Occasionlly/events or work releated wearer,"At home, Nearby local tailor, Stitch from wher...","Late delivery, Bad fitting, Visit multiple tim...",2 - 3 times,husband,Yes,Yes,Fast delivery,Perfect Fitting,Low cost
4,12/07/2025 15:43:11,Parneet Kaur,15-25,Female,Bathinda,both,3-5 times,"Suit, Kurta/Pajama","Self, Kids, Parents, All of above, Grand Paren...","Nearby local tailor, Boutique",Late delivery,Once,"Self, mom",Maybe,Yes,Fast delivery,Perfect Fitting,Low cost


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 18 columns):
 #   Column                                                                                                                                                                     Non-Null Count  Dtype 
---  ------                                                                                                                                                                     --------------  ----- 
 0   Timestamp                                                                                                                                                                  68 non-null     object
 1   Q1. Name                                                                                                                                                                   65 non-null     object
 2   Q2. Your Age                                                                                    

None

## Clean the data

### Subtask:
Remove unnecessary columns and handle any missing values.


**Reasoning**:
Drop irrelevant columns and check for missing values.



In [3]:
# Drop irrelevant columns
df_cleaned = df.drop(columns=['Timestamp', 'Q1. Name', 'Q4. Location City'])

# Check for missing values in the remaining columns
missing_values = df_cleaned.isnull().sum()
print("Missing values before handling:\n", missing_values)

# Handle missing values by dropping rows with any missing values
df_cleaned = df_cleaned.dropna()

# Display the number of remaining rows and columns after cleaning
print("\nShape of the DataFrame after cleaning:", df_cleaned.shape)

# Display missing values after handling
print("\nMissing values after handling:\n", df_cleaned.isnull().sum())

Missing values before handling:
 Q2. Your Age                                                                                                                                                                  0
Q3. Gender                                                                                                                                                                    0
Q5. Do you prefer stitched or ready-made?\nक्या आप सिले हुए कपड़े पसंद करते हैं या रेडीमेड?                                                                                   0
Q6. How often do you visit for clothes stitching per month?\nआप प्रति माह कपड़े सिलवाने के लिए कितनी बार दर्ज़ी के पास जाते हैं?                                              0
Q7. What kind of clothes do you mostly stitch?\nआप आमतौर पर किस प्रकार के कपड़े सिलवाते हैं?                                                                                  0
Q8. Who wears stitched clothes regularly?\nआपके परिवार में कौन सिले हुए कपड़े पहनता है?

## Preprocess the data

### Subtask:
Apply label encoding to categorical features and separate features and the target variable.


**Reasoning**:
Apply label encoding to the categorical features and separate the features (X) and the target variable (y).



In [4]:
from sklearn.preprocessing import LabelEncoder

# Identify categorical columns (all columns in df_cleaned are object type)
categorical_cols = df_cleaned.columns

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for col in categorical_cols:
    df_cleaned[col] = label_encoder.fit_transform(df_cleaned[col])

# Define the target variable (y)
target_column = 'Q14. Would you like pickup & delivery instead of visiting the tailor?\nक्या आप दर्ज़ी के पास जाने की बजाय कपड़ों की पिकअप और डिलीवरी सेवा पसंद करेंगे?'
y = df_cleaned[target_column]

# Define the features (X) by dropping the target column
X = df_cleaned.drop(columns=[target_column])

# Display the first few rows of X and y to verify
display(X.head())
display(y.head())

Unnamed: 0,Q2. Your Age,Q3. Gender,Q5. Do you prefer stitched or ready-made?\nक्या आप सिले हुए कपड़े पसंद करते हैं या रेडीमेड?,Q6. How often do you visit for clothes stitching per month?\nआप प्रति माह कपड़े सिलवाने के लिए कितनी बार दर्ज़ी के पास जाते हैं?,Q7. What kind of clothes do you mostly stitch?\nआप आमतौर पर किस प्रकार के कपड़े सिलवाते हैं?,Q8. Who wears stitched clothes regularly?\nआपके परिवार में कौन सिले हुए कपड़े पहनता है?,Q9. Who is your tailor?\nआपका दर्ज़ी कौन है?,Q10. Which problems did you face with tailors?\nआपको दर्ज़ी से जुड़ी कौन-कौन सी समस्याओं का सामना करना पड़ा है?,Q11. How many times do you have to visit for changes in order?\nएक ऑर्डर में बदलाव के लिए आपको कितनी बार दर्ज़ी के पास जाना पड़ता है?,Q12. Who usually goes to the tailor in your house?\nआपके घर में आमतौर पर दर्ज़ी के पास कौन जाता है?,"Q13. Would you like us to visit your place for measurement/stitching advice/ consultancy?\nक्या आप चाहेंगे कि हम माप लेने, सिलाई की सलाह या कंसल्टेंसी के लिए आपके घर आएं?",Q15. What is the most important thing in tailoring?\nटेलरिंग में आपके लिए सबसे महत्वपूर्ण चीज़ क्या है?\n(A) Most Preferred,(B) Second most preferred,(C) Least Preferred
0,0,1,2,0,21,12,15,14,2,5,1,2,0,1
1,0,1,0,0,19,5,14,16,0,6,2,2,0,1
2,0,0,0,0,0,4,11,7,2,9,2,2,0,1
3,2,0,0,0,4,4,5,10,0,10,2,0,2,1
4,0,0,2,1,31,13,15,3,2,3,0,0,2,1


Unnamed: 0,Q14. Would you like pickup & delivery instead of visiting the tailor?\nक्या आप दर्ज़ी के पास जाने की बजाय कपड़ों की पिकअप और डिलीवरी सेवा पसंद करेंगे?
0,2
1,2
2,2
3,2
4,2


## Split the data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the feature and target data into training and testing sets using train_test_split.



In [5]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets to verify the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (54, 14)
Shape of X_test: (14, 14)
Shape of y_train: (54,)
Shape of y_test: (14,)


## Train the model

### Subtask:
Train a RandomForestClassifier model on the training data.


**Reasoning**:
Import the necessary library and train the RandomForestClassifier model on the training data.



In [6]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate a RandomForestClassifier object
model = RandomForestClassifier(random_state=42)

# Fit the model to the training data
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the model's performance using appropriate metrics like accuracy or a classification report.


**Reasoning**:
Evaluate the trained model using the classification report.



In [7]:
from sklearn.metrics import classification_report

# Make predictions on the test set
y_pred = model.predict(X_test)

# Generate and print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.25      0.50      0.33         2
           1       0.00      0.00      0.00         1
           2       0.80      0.73      0.76        11

    accuracy                           0.64        14
   macro avg       0.35      0.41      0.37        14
weighted avg       0.66      0.64      0.65        14



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Analyze feature importance

### Subtask:
Determine the importance of each feature in predicting the target variable.


**Reasoning**:
Determine the importance of each feature in predicting the target variable by accessing the feature_importances_ attribute, creating a pandas Series, sorting it, and printing the result.



In [8]:
import pandas as pd

# Access the feature_importances_ attribute of the trained model
feature_importances = model.feature_importances_

# Create a pandas Series from the feature importances, using the column names of X_train as the index
feature_importance_series = pd.Series(feature_importances, index=X_train.columns)

# Sort the feature importance Series in descending order
sorted_feature_importances = feature_importance_series.sort_values(ascending=False)

# Print the sorted feature importances
print("Feature Importances (Sorted):")
print(sorted_feature_importances)

Feature Importances (Sorted):
Q13. Would you like us to visit your place for measurement/stitching advice/ consultancy?\nक्या आप चाहेंगे कि हम माप लेने, सिलाई की सलाह या कंसल्टेंसी के लिए आपके घर आएं?    0.228651
Q12. Who usually goes to the tailor in your house?\nआपके घर में आमतौर पर दर्ज़ी के पास कौन जाता है?                                                                           0.127845
Q9. Who is your tailor?\nआपका दर्ज़ी कौन है?                                                                                                                                  0.107963
Q8. Who wears stitched clothes regularly?\nआपके परिवार में कौन सिले हुए कपड़े पहनता है?                                                                                       0.103356
Q10. Which problems did you face with tailors?\nआपको दर्ज़ी से जुड़ी कौन-कौन सी समस्याओं का सामना करना पड़ा है?                                                               0.099688
Q2. Your Age                                           

## Summary:

### Data Analysis Key Findings

*   The dataset initially contained 18 columns and 68 entries.
*   After cleaning, which involved removing 'Timestamp', 'Q1. Name', and 'Q4. Location City', and handling missing values (though none were found), the dataset retained 15 columns and 68 rows.
*   All features were successfully label encoded to numerical values.
*   The data was split into training (54 samples) and testing (14 samples) sets, with 14 features each.
*   A RandomForestClassifier model was trained on the training data.
*   The model achieved an accuracy of 0.64 on the test set.
*   The classification report showed potential issues with predicting minority classes (specifically class 1) on the small test set, leading to an `UndefinedMetricWarning`.
*   The feature importance analysis provided a ranked list of features based on their contribution to the model's predictions.

### Insights or Next Steps

*   Investigate the class imbalance in the target variable and consider resampling techniques to improve model performance, especially on minority classes.
*   Given the small dataset size, consider cross-validation for more robust model evaluation and potentially exploring simpler models or collecting more data.


## Save the model

### Subtask:
Save the trained model to a file.

**Reasoning**:
Save the trained model to a file using pickle or joblib.

In [11]:
import joblib

# Define the filename for the saved model
model_filename = 'Tailor.joblib'

# Save the trained model to the file
joblib.dump(model, model_filename)

print(f"Model saved to {model_filename}")

Model saved to Tailor.joblib


## Explain the model

### Subtask:
Explain how the RandomForest model works in simple terms.

**Reasoning**:
Provide a simple explanation of the RandomForest model.

A RandomForest model is like having a team of decision-makers (decision trees) working together. Each decision tree makes its own prediction based on the data it sees. The RandomForest then combines the predictions from all these individual trees to make a final, more accurate prediction.

Think of it like asking many different people for their opinion on something and then going with the majority opinion. In this case, the "opinions" are the predictions from each decision tree, and the "majority opinion" is the final prediction of the RandomForest. This combination of multiple trees helps to reduce errors and improve the overall performance of the model.

## Load the model and make predictions

### Subtask:
Load the saved model and use it to make predictions on the test set.

**Reasoning**:
Load the saved model using `joblib.load` and then use the loaded model to make predictions on the `X_test` data. Display the first few predictions.

In [13]:
import joblib

# Define the filename of the saved model
model_filename = 'Tailor.joblib'

# Load the trained model from the file
loaded_model = joblib.load(model_filename)

# Use the loaded model to make predictions on the test set
loaded_model_predictions = loaded_model.predict(X_test)

# Display the first few predictions
print("First 10 predictions from the loaded model:")
print(loaded_model_predictions[:10])

# You can compare these predictions to the actual values in y_test
print("\nFirst 10 actual values from y_test:")
print(y_test[:10].values)

First 10 predictions from the loaded model:
[0 0 2 2 0 2 2 2 0 2]

First 10 actual values from y_test:
[0 2 2 2 2 0 2 2 2 2]


In [15]:
# Create a simple CSV-style dataset of tailoring startups with sample features for modeling or extension
startup_data = {
    "Startup Name": [
        "Tailor 24", "YourTailor.in", "Tech-Tailor", "TailoreMade",
        "TailorMe", "Tailor Smart", "Darzi", "Darzi On Call"
    ],
    "Founded Year": [2020, 2015, 2015, 2017, 2024, 2025, 2018, 2016],
    "City Started": [
        "Delhi", "Bangalore", "Bangalore", "Mumbai",
        "Pilot Region", "App-Based", "Backend", "Delhi NCR"
    ],
    "Delivery Time (Days)": [1, 4, 14, 5, 7, 10, 0, 10],
    "Target Audience": [
        "General", "Luxury", "Tech-savvy", "Mid-income",
        "Local Users", "Pan India", "Tailors Only", "High-end"
    ],
    "Status": [
        "Active", "Active", "Active", "Active",
        "Pilot", "New", "Active (B2B)", "Niche"
    ],
    "Has App": [
        "No", "Yes", "Yes", "Yes", "Yes", "Yes (iOS)", "No", "Web only"
    ],
    "Price Range": [
        "Unknown", "High", "High", "High", "Varies", "TBD", "Free", "High"
    ],
    "Scalable Potential (1-10)": [6, 5, 4, 7, 3, 8, 2, 5]
}

startups_df = pd.DataFrame(startup_data)

# Save to CSV
startup_csv_path = "/content/startups_dataset.csv"
startups_df.to_csv(startup_csv_path, index=False)

startup_csv_path

'/content/startups_dataset.csv'

# Task
Build a separate machine learning model in Python using the dataset from "/content/startups_dataset.csv". The model should be appropriate for the data and the task should be clearly defined based on the dataset's columns. The code should be in a Google Colab-compatible format and include data loading, exploration, preprocessing, model training, and evaluation.

## Load the data

### Subtask:
Load the `startups_dataset.csv` file into a pandas DataFrame.


**Reasoning**:
Load the CSV file into a pandas DataFrame.



In [16]:
# Load the CSV file into a pandas DataFrame
startups_df = pd.read_csv("/content/startups_dataset.csv")

# Display the first few rows of the DataFrame to verify loading
display(startups_df.head())

Unnamed: 0,Startup Name,Founded Year,City Started,Delivery Time (Days),Target Audience,Status,Has App,Price Range,Scalable Potential (1-10)
0,Tailor 24,2020,Delhi,1,General,Active,No,Unknown,6
1,YourTailor.in,2015,Bangalore,4,Luxury,Active,Yes,High,5
2,Tech-Tailor,2015,Bangalore,14,Tech-savvy,Active,Yes,High,4
3,TailoreMade,2017,Mumbai,5,Mid-income,Active,Yes,High,7
4,TailorMe,2024,Pilot Region,7,Local Users,Pilot,Yes,Varies,3


## Explore the data

### Subtask:
Display the first few rows and information about the columns to understand the dataset structure and identify potential target variables.


**Reasoning**:
Display the first few rows and information about the columns to understand the dataset structure and identify potential target variables.



In [17]:
# Display the first 5 rows of the DataFrame
display(startups_df.head())

# Display the column names and their data types
display(startups_df.info())

Unnamed: 0,Startup Name,Founded Year,City Started,Delivery Time (Days),Target Audience,Status,Has App,Price Range,Scalable Potential (1-10)
0,Tailor 24,2020,Delhi,1,General,Active,No,Unknown,6
1,YourTailor.in,2015,Bangalore,4,Luxury,Active,Yes,High,5
2,Tech-Tailor,2015,Bangalore,14,Tech-savvy,Active,Yes,High,4
3,TailoreMade,2017,Mumbai,5,Mid-income,Active,Yes,High,7
4,TailorMe,2024,Pilot Region,7,Local Users,Pilot,Yes,Varies,3


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Startup Name               8 non-null      object
 1   Founded Year               8 non-null      int64 
 2   City Started               8 non-null      object
 3   Delivery Time (Days)       8 non-null      int64 
 4   Target Audience            8 non-null      object
 5   Status                     8 non-null      object
 6   Has App                    8 non-null      object
 7   Price Range                8 non-null      object
 8   Scalable Potential (1-10)  8 non-null      int64 
dtypes: int64(3), object(6)
memory usage: 708.0+ bytes


None

## Define the modeling task

### Subtask:
Based on the available columns, determine what you want to predict (the target variable) and the type of model needed (e.g., classification or regression).


## Preprocess the data

### Subtask:
Clean and preprocess the data as needed for the chosen modeling task (e.g., handle categorical features, missing values, etc.).


**Reasoning**:
Identify categorical columns, apply one-hot encoding to them, and concatenate with numerical features.



In [18]:
# Identify categorical columns (excluding the target variable)
categorical_cols = startups_df.select_dtypes(include='object').columns.tolist()

# Apply one-hot encoding
startups_encoded = pd.get_dummies(startups_df, columns=categorical_cols, drop_first=True)

# Display the first few rows of the encoded DataFrame to verify
display(startups_encoded.head())

Unnamed: 0,Founded Year,Delivery Time (Days),Scalable Potential (1-10),Startup Name_Darzi On Call,Startup Name_Tailor 24,Startup Name_Tailor Smart,Startup Name_TailorMe,Startup Name_TailoreMade,Startup Name_Tech-Tailor,Startup Name_YourTailor.in,...,Status_New,Status_Niche,Status_Pilot,Has App_Web only,Has App_Yes,Has App_Yes (iOS),Price Range_High,Price Range_TBD,Price Range_Unknown,Price Range_Varies
0,2020,1,6,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,2015,4,5,False,False,False,False,False,False,True,...,False,False,False,False,True,False,True,False,False,False
2,2015,14,4,False,False,False,False,False,True,False,...,False,False,False,False,True,False,True,False,False,False
3,2017,5,7,False,False,False,False,True,False,False,...,False,False,False,False,True,False,True,False,False,False
4,2024,7,3,False,False,False,True,False,False,False,...,False,False,True,False,True,False,False,False,False,True


## Split the data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Define features (X) and target (y) and split the data into training and testing sets.



In [19]:
from sklearn.model_selection import train_test_split

# Define the features (X) by dropping the target variable 'Scalable Potential (1-10)'
X = startups_encoded.drop(columns=['Scalable Potential (1-10)'])

# Define the target variable (y)
y = startups_encoded['Scalable Potential (1-10)']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets to verify the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (6, 33)
Shape of X_test: (2, 33)
Shape of y_train: (6,)
Shape of y_test: (2,)


## Train a model

### Subtask:
Select and train an appropriate machine learning model for your chosen task.


**Reasoning**:
Select and train a RandomForestRegressor model on the training data since the task is regression and RandomForest works well with various data types.



In [20]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate a RandomForestRegressor model
model = RandomForestRegressor(random_state=42)

# Train the model using the training data
model.fit(X_train, y_train)

**Reasoning**:
The model has been trained. Now, evaluate the model's performance using appropriate regression metrics.



In [21]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 7.433299999999999
R-squared: -2.3036888888888885


## Evaluate the model

### Subtask:
Evaluate the model's performance using appropriate metrics like accuracy or a classification report.


**Reasoning**:
Evaluate the trained model using regression metrics.



In [22]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 7.433299999999999
R-squared: -2.3036888888888885


## Analyze the model (optional)

### Subtask:
Analyze feature importance or other relevant aspects of the trained model.


**Reasoning**:
Access the feature_importances_ attribute of the trained RandomForestRegressor model, create a pandas Series from the feature importances, sort it, and print the result to determine the importance of each feature in predicting the target variable.



In [23]:
import pandas as pd

# Access the feature_importances_ attribute of the trained model
feature_importances = model.feature_importances_

# Create a pandas Series from the feature importances, using the column names of X_train as the index
feature_importance_series = pd.Series(feature_importances, index=X_train.columns)

# Sort the feature importance Series in descending order
sorted_feature_importances = feature_importance_series.sort_values(ascending=False)

# Print the sorted feature importances
print("Feature Importances (Sorted):")
print(sorted_feature_importances)

Feature Importances (Sorted):
Delivery Time (Days)            0.211740
Founded Year                    0.108956
City Started_Backend            0.073702
City Started_Mumbai             0.072095
Price Range_High                0.069949
Target Audience_Tailors Only    0.061277
Startup Name_TailoreMade        0.049847
Status_Active (B2B)             0.046496
Startup Name_Tailor 24          0.041259
Price Range_Unknown             0.038696
Has App_Yes                     0.037588
City Started_Delhi              0.033508
Has App_Web only                0.021829
Target Audience_Local Users     0.021371
Target Audience_Mid-income      0.019465
Startup Name_Tech-Tailor        0.019311
City Started_Pilot Region       0.018248
Startup Name_TailorMe           0.013039
Price Range_Varies              0.008027
Target Audience_High-end        0.007686
City Started_Bangalore          0.006403
City Started_Delhi NCR          0.005788
Status_Pilot                    0.005029
Startup Name_Darzi On Call 

## Summary:

### Data Analysis Key Findings

*   The dataset contains 8 entries and 8 columns, with no missing values.
*   The target variable chosen for prediction was 'Scalable Potential (1-10)', making the task a regression problem.
*   Categorical features were successfully one-hot encoded, resulting in 33 feature columns.
*   The data was split into training (6 samples) and testing (2 samples) sets.
*   A RandomForestRegressor model was trained.
*   The model achieved a Mean Squared Error (MSE) of 7.43 and an R-squared ($R^2$) of -2.30 on the test set, indicating poor performance.
*   Feature importance analysis showed 'Delivery Time (Days)' and 'Founded Year' as the most influential features according to the model.

### Insights or Next Steps

*   The small dataset size (only 8 samples) is likely the primary reason for the poor model performance and negative R-squared. A significantly larger dataset is needed for meaningful model training and evaluation.
*   Given the limited data, exploring simpler models or alternative approaches like domain expertise-based rules might be more appropriate than complex machine learning models.


## Define the modeling task

### Subtask:
Based on the available columns, determine what you want to predict (the target variable) and the type of model needed (e.g., classification or regression).

**Reasoning**:
Define the target variable as 'Scalable Potential (1-10)', making this a regression task.

## Preprocess the data

### Subtask:
Clean and preprocess the data as needed for the chosen modeling task (e.g., handle categorical features, missing values, etc.).

**Reasoning**:
Identify categorical columns, apply one-hot encoding to them, and concatenate with numerical features.

In [26]:
# Identify categorical columns (excluding the target variable)
categorical_cols = startups_df.select_dtypes(include='object').columns.tolist()

# Apply one-hot encoding
startups_encoded = pd.get_dummies(startups_df, columns=categorical_cols, drop_first=True)

# Display the first few rows of the encoded DataFrame to verify
display(startups_encoded.head())

Unnamed: 0,Founded Year,Delivery Time (Days),Scalable Potential (1-10),Startup Name_Darzi On Call,Startup Name_Tailor 24,Startup Name_Tailor Smart,Startup Name_TailorMe,Startup Name_TailoreMade,Startup Name_Tech-Tailor,Startup Name_YourTailor.in,...,Status_New,Status_Niche,Status_Pilot,Has App_Web only,Has App_Yes,Has App_Yes (iOS),Price Range_High,Price Range_TBD,Price Range_Unknown,Price Range_Varies
0,2020,1,6,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,2015,4,5,False,False,False,False,False,False,True,...,False,False,False,False,True,False,True,False,False,False
2,2015,14,4,False,False,False,False,False,True,False,...,False,False,False,False,True,False,True,False,False,False
3,2017,5,7,False,False,False,False,True,False,False,...,False,False,False,False,True,False,True,False,False,False
4,2024,7,3,False,False,False,True,False,False,False,...,False,False,True,False,True,False,False,False,False,True


## Split the data

### Subtask:
Split the data into training and testing sets.

**Reasoning**:
Define features (X) and target (y) and split the data into training and testing sets.

In [27]:
from sklearn.model_selection import train_test_split

# Define the features (X) by dropping the target variable 'Scalable Potential (1-10)'
X = startups_encoded.drop(columns=['Scalable Potential (1-10)'])

# Define the target variable (y)
y = startups_encoded['Scalable Potential (1-10)']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets to verify the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (6, 33)
Shape of X_test: (2, 33)
Shape of y_train: (6,)
Shape of y_test: (2,)


## Train a model

### Subtask:
Select and train an appropriate machine learning model for your chosen task.

**Reasoning**:
Select and train a RandomForestRegressor model on the training data since the task is regression and RandomForest works well with various data types.

In [28]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate a RandomForestRegressor model
model = RandomForestRegressor(random_state=42)

# Train the model using the training data
model.fit(X_train, y_train)

**Reasoning**:
The model has been trained. Now, evaluate the model's performance using appropriate regression metrics.

In [29]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 7.433299999999999
R-squared: -2.3036888888888885


## Evaluate the model

### Subtask:
Evaluate the model's performance using appropriate metrics like accuracy or a classification report.

**Reasoning**:
Evaluate the trained model using regression metrics.

In [30]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model using Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 7.433299999999999
R-squared: -2.3036888888888885


## Analyze the model (optional)

### Subtask:
Analyze feature importance or other relevant aspects of the trained model.

**Reasoning**:
Access the feature_importances_ attribute of the trained RandomForestRegressor model, create a pandas Series from the feature importances, sort it, and print the result to determine the importance of each feature in predicting the target variable.

In [31]:
import pandas as pd

# Access the feature_importances_ attribute of the trained model
feature_importances = model.feature_importances_

# Create a pandas Series from the feature importances, using the column names of X_train as the index
feature_importance_series = pd.Series(feature_importances, index=X_train.columns)

# Sort the feature importance Series in descending order
sorted_feature_importances = feature_importance_series.sort_values(ascending=False)

# Print the sorted feature importances
print("Feature Importances (Sorted):")
print(sorted_feature_importances)

Feature Importances (Sorted):
Delivery Time (Days)            0.211740
Founded Year                    0.108956
City Started_Backend            0.073702
City Started_Mumbai             0.072095
Price Range_High                0.069949
Target Audience_Tailors Only    0.061277
Startup Name_TailoreMade        0.049847
Status_Active (B2B)             0.046496
Startup Name_Tailor 24          0.041259
Price Range_Unknown             0.038696
Has App_Yes                     0.037588
City Started_Delhi              0.033508
Has App_Web only                0.021829
Target Audience_Local Users     0.021371
Target Audience_Mid-income      0.019465
Startup Name_Tech-Tailor        0.019311
City Started_Pilot Region       0.018248
Startup Name_TailorMe           0.013039
Price Range_Varies              0.008027
Target Audience_High-end        0.007686
City Started_Bangalore          0.006403
City Started_Delhi NCR          0.005788
Status_Pilot                    0.005029
Startup Name_Darzi On Call 

## Summary:

### Data Analysis Key Findings

* The dataset contains 8 entries and 8 columns, with no missing values.
* The target variable chosen for prediction was 'Scalable Potential (1-10)', making the task a regression problem.
* Categorical features were successfully one-hot encoded, resulting in 33 feature columns.
* The data was split into training (6 samples) and testing (2 samples) sets.
* A RandomForestRegressor model was trained.
* The model achieved a Mean Squared Error (MSE) of 7.43 and an R-squared ($R^2$) of -2.30 on the test set, indicating poor performance.
* Feature importance analysis showed 'Delivery Time (Days)' and 'Founded Year' as the most influential features according to the model.

### Insights or Next Steps

* The small dataset size (only 8 samples) is likely the primary reason for the poor model performance and negative R-squared. A significantly larger dataset is needed for meaningful model training and evaluation.
* Given the limited data, exploring simpler models or alternative approaches like domain expertise-based rules might be more appropriate than complex machine learning models.

## Save the model

### Subtask:
Save the trained model to a file.

**Reasoning**:
Save the trained model to a file using joblib.

In [34]:
import joblib

# Define the filename for the saved model
model_filename_startup = 'startup_model.joblib'

# Save the trained model to the file
joblib.dump(model, model_filename_startup)

print(f"Model saved to {model_filename_startup}")

Model saved to startup_model.joblib


## Load the model and check predictions

### Subtask:
Load the saved model and use it to make predictions to ensure it's working correctly.

**Reasoning**:
Load the saved model using `joblib.load` and then use the loaded model to make predictions on the `X_test` data to verify it's working.

In [35]:
import joblib

# Define the filename of the saved model
model_filename_startup = 'startup_model.joblib'

# Load the trained model from the file
loaded_startup_model = joblib.load(model_filename_startup)

# Use the loaded model to make predictions on the test set
loaded_startup_model_predictions = loaded_startup_model.predict(X_test)

# Display the predictions from the loaded model
print("Predictions from the loaded startup model on the test set:")
print(loaded_startup_model_predictions)

# You can compare these predictions to the actual values in y_test
print("\nActual values from y_test:")
print(y_test.values)

Predictions from the loaded startup model on the test set:
[5.21 4.15]

Actual values from y_test:
[5 8]
