#**Model Building on a Synthetic Dataset**

##**Assignment**
The two synthetic datasets were generated using the same underlying data model. Your goal is to build a predictive model using the data in the training dataset to predict the withheld target values from the test set.

You may use any tools available to you for this task. Ultimately, we will assess predictive accuracy on the test set using the mean squared error metric. You should produce the following:

- A 1,000 x 1 text file containing 1 prediction per line for each record in the test dataset.
- A brief writeup describing the techniques you used to generate the predictions. Details such as important features and your estimates of predictive performance are helpful here, though not strictly necessary.
- (Optional) An implementable version of your model. What this would look like largely depends on the methods you used, but could include things like source code, a pickled Python object, a PMML file, etc. Please do not include any compiled executables.



##**Data Description**
We have provided two tab-delimited files along with these instructions:

- codetest_train.txt: 5,000 records x 254 features + 1 target

- codetest_test.txt : 1,000 records x 254 features



###**Practicalities**
The purpose of this test is to test your ability to write software to collect, normalize, store, analyze and visualize “real world” data. You may also use any tools or software on your computer, or that are freely available on the Internet. We prefer that you use simpler tools to more complex ones and that you are “lazy” in the sense of using third party APIs and libraries as much as possible. We encourage the reuse of code when appropriate. If you include code directly in your submission that was written by someone else other than commonly imported modules, please be sure to provide proper attribution, including a URL, text, author, etc. or other available information in the code comments.

Do as much as you can, as well as you can. Prefer efficient, elegant solutions. Prefer scripted analysis to unrepeatable use of GUI tools. For data security and transfer time reasons, you have been given a relatively small data file. Prefer solutions that do not require the full data set to be stored in memory.

There is certainly no requirement that you have previous experience working on these kinds of problems. Rather, we are looking for an ability to research and select the appropriate tools for an open-ended problem and implement something meaningful. We are also interested in your ability to work on a team, which means considering how to package and deliver your results in a way that makes it easy for others to review them. Undocumented code and data dumps are virtually useless; commented code and a clear writeup with elegant visuals are ideal.

#### To download the dataset <a href="https://drive.google.com/drive/folders/1fTWdsoPQCXaqBwrwjZQkzb8t-aA0ijTJ?usp=sharing"> Click here </a>

In [1]:
import pandas as pd


train_data = pd.read_csv('codetest_train.txt', delimiter='\t')


test_data = pd.read_csv('codetest_test.txt', delimiter='\t')


train_data.head()


Unnamed: 0,target,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,...,f_244,f_245,f_246,f_247,f_248,f_249,f_250,f_251,f_252,f_253
0,3.066056,-0.653,0.255,-0.615,-1.833,-0.736,,1.115,-0.171,-0.351,...,-1.607,-1.4,-0.92,-0.198,-0.945,-0.573,0.17,-0.418,-1.244,-0.503
1,-1.910473,1.179,-0.093,-0.556,0.811,-0.468,-0.005,-0.116,-1.243,1.985,...,1.282,0.032,-0.061,,-0.061,-0.302,1.281,-0.85,0.821,-0.26
2,7.830711,0.181,-0.778,-0.919,0.113,0.887,-0.762,1.872,-1.709,0.135,...,-0.237,-0.66,1.073,-0.193,0.57,-0.267,1.435,1.332,-1.147,2.58
3,-2.180862,0.745,-0.245,-1.343,1.163,-0.169,-0.151,-1.1,0.225,1.223,...,0.709,-0.203,-0.136,-0.571,1.682,0.243,-0.381,0.613,1.033,0.4
4,5.462784,1.217,-1.324,-0.958,0.448,-2.873,-0.856,0.603,0.763,0.02,...,0.892,-0.433,-0.877,0.289,0.654,1.23,0.457,-0.754,-0.025,-0.931


In [2]:
# Display basic information about the training data
train_data.info()

# Display basic statistics about the training data
train_data.describe()

# Display the first few rows of the test data
test_data.head()

# Display basic information about the test data
test_data.info()

# Display basic statistics about the test data
test_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 255 entries, target to f_253
dtypes: float64(251), object(4)
memory usage: 9.7+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 254 entries, f_0 to f_253
dtypes: float64(250), object(4)
memory usage: 1.9+ MB


Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_244,f_245,f_246,f_247,f_248,f_249,f_250,f_251,f_252,f_253
count,972.0,983.0,983.0,982.0,978.0,984.0,982.0,978.0,980.0,979.0,...,973.0,976.0,985.0,979.0,986.0,980.0,981.0,978.0,979.0,985.0
mean,-0.035218,0.100792,0.021098,0.005391,-0.042323,-0.013639,-0.033501,0.025939,0.029102,0.000221,...,0.022764,-0.071009,0.038268,0.053555,-0.021745,-0.028499,-0.025662,0.003518,-0.007108,-0.017024
std,0.978034,0.989471,1.028476,0.989312,1.016019,1.038796,1.018101,0.971878,0.962718,0.999688,...,0.973855,1.01004,0.981809,1.026265,1.028849,0.941963,1.045912,1.001211,0.99815,1.015624
min,-2.997,-2.76,-3.517,-3.017,-3.332,-3.241,-3.441,-2.828,-2.911,-4.457,...,-2.801,-4.186,-3.156,-4.145,-3.4,-2.86,-3.364,-2.959,-2.985,-3.491
25%,-0.679,-0.575,-0.654,-0.67175,-0.706,-0.73,-0.752,-0.666,-0.569,-0.715,...,-0.586,-0.75275,-0.611,-0.6885,-0.7625,-0.6585,-0.73,-0.67125,-0.6845,-0.722
50%,-0.0455,0.149,0.033,0.027,-0.0305,-0.0245,-0.0825,0.0175,0.0075,-0.035,...,0.02,-0.099,0.05,0.041,-0.013,-0.035,-0.051,0.0305,0.01,-0.009
75%,0.6715,0.715,0.697,0.66175,0.594,0.7305,0.659,0.6815,0.7165,0.679,...,0.653,0.58675,0.691,0.726,0.649,0.587,0.717,0.67025,0.6795,0.655
max,3.022,3.188,3.122,2.927,3.462,2.873,3.73,3.176,2.863,3.023,...,3.041,3.496,3.019,3.971,3.047,3.351,3.336,3.083,2.727,3.413


In [4]:
# Identify non-numeric columns in the training dataset
non_numeric_columns_train = train_data.select_dtypes(exclude=['float64']).columns
non_numeric_columns_test = test_data.select_dtypes(exclude=['float64']).columns

print("Non-numeric columns in training data:", non_numeric_columns_train)
print("Non-numeric columns in test data:", non_numeric_columns_test)

# Convert non-numeric columns to numeric where possible, or drop them if not possible
train_data[non_numeric_columns_train] = train_data[non_numeric_columns_train].apply(pd.to_numeric, errors='coerce')
test_data[non_numeric_columns_test] = test_data[non_numeric_columns_test].apply(pd.to_numeric, errors='coerce')

# Impute missing values with the mean of each column for both training and test data
train_data.fillna(train_data.mean(), inplace=True)
test_data.fillna(test_data.mean(), inplace=True)

# Verify if missing values are handled
missing_values_train_after = train_data.isnull().sum()
missing_values_test_after = test_data.isnull().sum()

print("Missing values in training data after imputation:", missing_values_train_after[missing_values_train_after > 0])
print("Missing values in test data after imputation:", missing_values_test_after[missing_values_test_after > 0])


Non-numeric columns in training data: Index(['f_61', 'f_121', 'f_215', 'f_237'], dtype='object')
Non-numeric columns in test data: Index(['f_61', 'f_121', 'f_215', 'f_237'], dtype='object')
Missing values in training data after imputation: f_61     5000
f_121    5000
f_215    5000
f_237    5000
dtype: int64
Missing values in test data after imputation: f_61     1000
f_121    1000
f_215    1000
f_237    1000
dtype: int64


In [5]:

train_data_cleaned = train_data.drop(columns=['f_61', 'f_121', 'f_215', 'f_237'])
test_data_cleaned = test_data.drop(columns=['f_61', 'f_121', 'f_215', 'f_237'])


print("Training data shape after dropping non-numeric columns:", train_data_cleaned.shape)
print("Test data shape after dropping non-numeric columns:", test_data_cleaned.shape)


Training data shape after dropping non-numeric columns: (5000, 251)
Test data shape after dropping non-numeric columns: (1000, 250)


In [6]:
# Separate features and target variable in the training data
X_train = train_data_cleaned.drop(columns=['target'])
y_train = train_data_cleaned['target']

# Features in the test data (they should match X_train features)
X_test = test_data_cleaned

# Verify the shapes
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)


X_train shape: (5000, 250)
y_train shape: (5000,)
X_test shape: (1000, 250)


In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test dataset
predictions = model.predict(X_test)

# Check the first few predictions
print("First few predictions:", predictions[:10])

# Save predictions to a text file
np.savetxt('predictions.txt', predictions, fmt='%.6f')

First few predictions: [ 7.51465989  1.03470264  1.53740358  6.15150269  6.11193044  3.95086369
 -5.03314059 -4.76487178  5.26160478 -2.05767167]


In [8]:
import random

# two random indices from the test dataset
random_indices = random.sample(range(X_test.shape[0]), 2)

# corresponding rows from the test dataset
random_samples = X_test.iloc[random_indices]

# predictions for these random samples
random_predictions = model.predict(random_samples)

# the indices and predictions
for idx, prediction in zip(random_indices, random_predictions):
    print(f"Index: {idx}, Prediction: {prediction:.6f}")

Index: 517, Prediction: 0.156188
Index: 192, Prediction: 11.089300
