#**Model Building on a Synthetic Dataset**

##**Assignment**
The two synthetic datasets were generated using the same underlying data model. Your goal is to build a predictive model using the data in the training dataset to predict the withheld target values from the test set.

You may use any tools available to you for this task. Ultimately, we will assess predictive accuracy on the test set using the mean squared error metric. You should produce the following:

- A 1,000 x 1 text file containing 1 prediction per line for each record in the test dataset.
- A brief writeup describing the techniques you used to generate the predictions. Details such as important features and your estimates of predictive performance are helpful here, though not strictly necessary.
- (Optional) An implementable version of your model. What this would look like largely depends on the methods you used, but could include things like source code, a pickled Python object, a PMML file, etc. Please do not include any compiled executables.



##**Data Description**
We have provided two tab-delimited files along with these instructions:

- codetest_train.txt: 5,000 records x 254 features + 1 target

- codetest_test.txt : 1,000 records x 254 features



###**Practicalities**
The purpose of this test is to test your ability to write software to collect, normalize, store, analyze and visualize “real world” data. You may also use any tools or software on your computer, or that are freely available on the Internet. We prefer that you use simpler tools to more complex ones and that you are “lazy” in the sense of using third party APIs and libraries as much as possible. We encourage the reuse of code when appropriate. If you include code directly in your submission that was written by someone else other than commonly imported modules, please be sure to provide proper attribution, including a URL, text, author, etc. or other available information in the code comments.

Do as much as you can, as well as you can. Prefer efficient, elegant solutions. Prefer scripted analysis to unrepeatable use of GUI tools. For data security and transfer time reasons, you have been given a relatively small data file. Prefer solutions that do not require the full data set to be stored in memory.

There is certainly no requirement that you have previous experience working on these kinds of problems. Rather, we are looking for an ability to research and select the appropriate tools for an open-ended problem and implement something meaningful. We are also interested in your ability to work on a team, which means considering how to package and deliver your results in a way that makes it easy for others to review them. Undocumented code and data dumps are virtually useless; commented code and a clear writeup with elegant visuals are ideal.

#### To download the dataset <a href="https://drive.google.com/drive/folders/1fTWdsoPQCXaqBwrwjZQkzb8t-aA0ijTJ?usp=sharing"> Click here </a>

In [22]:
import pandas as pd

# Load the training data
train_file_path = "C:/Users/manoj/Downloads/codetest_train.txt"
train_data = pd.read_csv(train_file_path, delimiter='\t')

# Load the test data
test_file_path = "C:/Users/manoj/Downloads/codetest_test.txt"
test_data = pd.read_csv(test_file_path, delimiter='\t')

# Check the shape of the datasets
print(train_data.shape)  # Should be (5000, 255)
print(test_data.shape)   # Should be (1000, 254)


(5000, 255)
(1000, 254)


In [23]:
# Display the first few rows of the training data
print(train_data.head())

# Check for missing values
print(train_data.isnull().sum().sum())
print(test_data.isnull().sum().sum())

# Check for non-numeric values
print(train_data.applymap(lambda x: isinstance(x, (int, float))).all())


     target    f_0    f_1    f_2    f_3    f_4    f_5    f_6    f_7    f_8  \
0  3.066056 -0.653  0.255 -0.615 -1.833 -0.736    NaN  1.115 -0.171 -0.351   
1 -1.910473  1.179 -0.093 -0.556  0.811 -0.468 -0.005 -0.116 -1.243  1.985   
2  7.830711  0.181 -0.778 -0.919  0.113  0.887 -0.762  1.872 -1.709  0.135   
3 -2.180862  0.745 -0.245 -1.343  1.163 -0.169 -0.151 -1.100  0.225  1.223   
4  5.462784  1.217 -1.324 -0.958  0.448 -2.873 -0.856  0.603  0.763  0.020   

   ...  f_244  f_245  f_246  f_247  f_248  f_249  f_250  f_251  f_252  f_253  
0  ... -1.607 -1.400 -0.920 -0.198 -0.945 -0.573  0.170 -0.418 -1.244 -0.503  
1  ...  1.282  0.032 -0.061    NaN -0.061 -0.302  1.281 -0.850  0.821 -0.260  
2  ... -0.237 -0.660  1.073 -0.193  0.570 -0.267  1.435  1.332 -1.147  2.580  
3  ...  0.709 -0.203 -0.136 -0.571  1.682  0.243 -0.381  0.613  1.033  0.400  
4  ...  0.892 -0.433 -0.877  0.289  0.654  1.230  0.457 -0.754 -0.025 -0.931  

[5 rows x 255 columns]
25207
4973


  print(train_data.applymap(lambda x: isinstance(x, (int, float))).all())


target    True
f_0       True
f_1       True
f_2       True
f_3       True
          ... 
f_249     True
f_250     True
f_251     True
f_252     True
f_253     True
Length: 255, dtype: bool


In [26]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Separate features and target variable in training data
X_train = train_data.drop(columns=['target'])
y_train = train_data['target']

# Identify and convert non-numeric columns to numeric (if possible) or drop them
for col in X_train.columns:
    if X_train[col].dtype == 'object':
        # Attempt to convert non-numeric values to numeric
        X_train[col] = pd.to_numeric(X_train[col], errors='coerce')
        test_data[col] = pd.to_numeric(test_data[col], errors='coerce')

# Fill missing values with SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(test_data)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)


In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42)
}

# Evaluate models using cross-validation
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
    print(f"{name} - Mean Squared Error: {-scores.mean():.4f}")



Linear Regression - Mean Squared Error: 14.8377
Random Forest - Mean Squared Error: 13.3959
Gradient Boosting - Mean Squared Error: 9.2339


In [28]:
# For demonstration, let's assume Gradient Boosting performed the best
best_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
best_model.fit(X_train_scaled, y_train)


In [29]:
predictions = best_model.predict(X_test_scaled)

In [30]:
import numpy as np

# Save predictions to a text file
output_file_path = "C:/Users/manoj/Downloads/predictions.txt"
np.savetxt(output_file_path, predictions, fmt='%.6f')


In [31]:
#report.

#Techniques Used:

#Data Preprocessing: Converted non-numeric values to numeric, filled missing values with column means, and standardized the features.
#Model Selection: Experimented with Linear Regression, Random Forest, and Gradient Boosting. Selected Gradient Boosting based on cross-validation 
#results.
#Evaluation Metric: Used Mean Squared Error (MSE) for model evaluation.
#Important Features:

#Feature importance can be extracted from the Gradient Boosting model, but this step is optional for now.
#Predictive Performance:

#Gradient Boosting achieved the lowest MSE during cross-validation, indicating it generalizes well on unseen data.
#Future Work:

#Further tuning of hyperparameters.
#Experiment with additional models and ensemble methods.

In [32]:
import pickle

# Save the model to a file
model_file_path = "C:/Users/manoj/Downloads/best_model.pkl"
with open(model_file_path, 'wb') as f:
    pickle.dump(best_model, f)
