It is a sample notebook for illustration purposes only. We recommend including the below cell with important candidate instructions.
You may need to update the OS and package versions based on the current environment.

### Environment
Ubuntu 22.04 LTS which includes **Python 3.9.12** and utilities *curl*, *git*, *vim*, *unzip*, *wget*, and *zip*. There is no *GPU* support.

The IPython Kernel allows you to execute Python code in the Notebook cell and Python console.

### Installing packages
- Run `!mamba list "package_name"` command to check the package installation status. For example,

```python
!mamba list numpy
"""
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
numpy                     1.21.6           py39h18676bf_0    conda-forge
"""
```

    You can also try importing the package.

- Run the `!mamba install "package_name"` to install a package

### Excluding large files
HackerRank rejects any submission larger than **20MB**. Therefore, you must exclude any large files by adding these to the *.gitignore* file.
You can **Submit** code to validate the status of your submission.

## Introduction

The Occupational Employment and Wage Statistics (OEWS) program produces employment and wage estimates annually for nearly 800 occupations. These estimates are available for the nation as a whole, for individual states, and for metropolitan and nonmetropolitan areas; national occupational estimates for specific industries are also available.

## Problem

The data used in this problem is a subset of the OEWS data, which include the 10-th percentile, 25-th percentile, 50-th percentile (a.k.a median), 75-th percentile, and 90-th percentile of the annual salary of a given combination of states, industries, and occupations.

One needs to use the data in _train.csv_ to train a machine learning model to predict the 10-th, 25-th, 50-th, 75-th and 90-th percentiles of the given combinations in _submission.csv_.

## Data

### Independent Variables

There are three independent variable columns:
- PRIM_STATE
- NAICS_TITLE
- OCC_TITLE

indicating the state, industry, and occupation.

NOTE:
- In the _PRIM_STATE_ variable, each category indicates a state postal abbreviation (like "_CA_", "_TX_", etc.) or "_U.S_" as the whole United States. When _PRIM_STATE_ is "_U.S_", it means the percentiles are aggregated across all the states.
- In thes _NAICS_TITLE_, each category indicates an industry sector name (like "_Retail Trade_", "_Manufacturing_") or "_Cross-industry_". When _NAICS_TITLE_ is "_Cross-industry_", it means the percentiles are aggregated across all the industries.

### Target Variables

There are 5 dependent (target) variable columns:
- A_PCT10
- A_PCT25
- A_MEDIAN
- A_PCT75
- A_PCT90

indicating the 10-th percentile, 25-th percentile, median, 75-th percentile, 90-th percentile of the annual base salary given the state, industry, and occupation information.

**IMPORTANT**: the percentiles should follow an increasing order. Namely, the 10-th percentile is less than (<) the 25-th percentile, the 25-th percentile is less than (<) the 50-th percentile, etc.

## Deliverables

### Submit a Well commented Jupyter Notebook

Explore the data, make visualizations, and generate new features if required. Make appropriate plots, annotate the notebook with markdowns and explain necessary inferences. A person should be able to read the notebook and understand the steps taken as well as the reasoning behind them. The solution will be graded on the basis of the usage of effective visualizations to convey the analysis and the modeling process.


### Submit _submission.csv_

In the given _submission.csv_, values in the "A_PCT10", "A_PCT25", "A_MEDIAN", "A_PCT75", and "A_PCT90" columns are constants, and you need to replace them with your model predictions.

**IMPORTANT**:
- please do not change the header given in _submission.csv_, or your predictions may not be evaluated correctly.
- Your Jupyter Notebook should be able to generate your submitted predictions.



## Evaluation Metric

The model performance is evaluated by the mean normalized weighted absolute error (MNWAE) defined as the following:
$$ MNWAE = \frac{1}{n} \sum_{i=1}^{n} \sum_{j \in \{10, 25, 50, 75, 90\}} w_j \times \frac{|y_{i,j}-z_{i,j}|}{z_{i,j}}$$
where $y_{i,j}$ and $z_{i,j}$ are the model estimation and the ground truth of the $i$-th row and $j$-th percentile, and
$$ w_{10} = w_{90} = 0.1, $$
$$ w_{25} = w_{75} = 0.2, $$
$$ w_{50} = 0.4 $$

For example, if

actual percentiles = [10000, 30000, 60000, 80000, 100000],

predicted percentiles = [11000, 33000, 54000, 88000, 120000],

normalized weighted absolute error = 0.1*|11000-10000|/10000+0.2*|33000-30000|/30000+0.4*|54000-60000|/60000+0.2*|88000-80000|/80000+0.1*|120000-100000|/100000 = 0.11

**IMPORTANT**: if the predicted percentiles in any row do not follow an increasing order, all the predictions will be considered as invalid.

## Solution ..

In [40]:
#import transformers for embedding 

In [94]:
# Import the `pandas` library to load the dataset
import pandas as pd
import matplotlib.pyplot as plt  
import numpy as np  

In [2]:
df_train = pd.read_csv('train.csv')
df_train.shape

(2297, 8)

In [3]:
df_train.head()
df_train.shape

(2297, 8)

In [4]:
#check data missing value
df_train.isnull().sum()

PRIM_STATE       0
NAICS_TITLE      0
OCC_TITLE        0
A_PCT10          0
A_PCT25          0
A_MEDIAN         0
A_PCT75         30
A_PCT90        139
dtype: int64

In [5]:
#remove missing value
df_train = df_train.dropna()
df_train.shape

(2158, 8)

In [6]:
#check data missing value
df_train.isnull().sum()

PRIM_STATE     0
NAICS_TITLE    0
OCC_TITLE      0
A_PCT10        0
A_PCT25        0
A_MEDIAN       0
A_PCT75        0
A_PCT90        0
dtype: int64

In [7]:
#from  sentence_transformers import  AutoModel, AutoModelWithLMHead, BertModel, BertTokenizer
from sentence_transformers import SentenceTransformer
#embed the test into  768 
model = SentenceTransformer('bert-base-nli-mean-tokens')
#define embed len of 768 to  99
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [8]:
df_train.columns

Index(['PRIM_STATE', 'NAICS_TITLE', 'OCC_TITLE', 'A_PCT10', 'A_PCT25',
       'A_MEDIAN', 'A_PCT75', 'A_PCT90'],
      dtype='object')

``EMbedding the  Text ``

In [9]:
NAICS_TITLE_Embed = model.encode(df_train['NAICS_TITLE'])
OCC_TITLE_Embed = model.encode(df_train['OCC_TITLE'])
PRIM_STATE_Embed = model.encode(df_train['PRIM_STATE'])

``Making  columns for  768  for each  text columns ``

In [10]:
new_data=pd.DataFrame(columns=['N_'+str(i)  for i in  range(768)]+['O_'+str(i)  for i in  range(768)]+['P_'+str(i) for i in  range(768)])

In [11]:
df_train.shape

(2158, 8)

In [12]:
NAICS_TITLE_Embed.shape

(2158, 768)

``Loading  data in new DataFrame that above mentioned``

In [14]:
for i in  range(768):
    new_data['N_'+str(i)]=NAICS_TITLE_Embed[:,i]
    new_data['O_'+str(i)]=OCC_TITLE_Embed[:,i]
    new_data['P_'+str(i)]=PRIM_STATE_Embed[:,i]

``Normalize data into  Zscore ``

In [15]:
from  sklearn.preprocessing  import  StandardScaler
scaler = StandardScaler()
df_scale=scaler.fit_transform(df_train[['A_PCT10', 'A_PCT25','A_MEDIAN', 'A_PCT75', 'A_PCT90']])
df_scale.shape

(2158, 5)

``Saving  the mean and varience for  further de normalise  for Y target ``

In [17]:
var_=[1.12387245e+08, 2.20044299e+08, 4.48678612e+08, 8.45444517e+08,
        1.52372954e+09]
mean_=[32137.2613531 , 40012.56255792, 51940.24096386, 67975.05560704,
        87937.31232623]

In [18]:
scaler.var_,scaler.mean_

(array([1.12387245e+08, 2.20044299e+08, 4.48678612e+08, 8.45444517e+08,
        1.52372954e+09]),
 array([32137.2613531 , 40012.56255792, 51940.24096386, 67975.05560704,
        87937.31232623]))

``Merging X and Y with 5 output together in  new_data DataFrame ``

``Note: I  have not  normalise the X data after apply Embedding   ``

In [19]:
new_data[['A_PCT10', 'A_PCT25','A_MEDIAN', 'A_PCT75', 'A_PCT90']]=df_scale

  self[col] = igetitem(value, i)


In [88]:
new_data.head(5)

Unnamed: 0,N_0,N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,P_763,P_764,P_765,P_766,P_767,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,0.428084,-0.900451,2.462716,0.794832,0.033432,-0.382973,-1.133532,0.422672,0.126313,-0.616636,...,-0.20639,0.281729,-0.343159,0.250593,-0.158562,0.020067,0.012636,-0.054303,-0.186235,-0.241253
1,0.119501,0.233227,0.515907,0.275239,0.470334,-0.071091,-1.142169,0.681103,0.481496,-0.009242,...,-0.20639,0.281729,-0.343159,0.250593,-0.158562,1.483097,1.455278,1.693405,1.357966,1.68215
2,0.227968,-0.63299,1.029864,-0.022027,0.892578,-0.264391,-1.767998,0.846168,0.171998,-0.618307,...,-0.20639,0.281729,-0.343159,0.250593,-0.158562,2.556552,1.553027,1.78452,2.006599,1.5661
3,-0.105059,0.49814,1.063854,0.221532,0.707098,-0.024127,-0.434689,-0.418044,0.072688,-0.28287,...,-0.20639,0.281729,-0.343159,0.250593,-0.158562,0.488878,0.513516,0.406465,0.325862,0.283148
4,0.410819,0.30781,1.964899,0.087561,-0.291335,1.094691,-0.000369,0.907387,-0.160123,0.023419,...,-0.20639,0.281729,-0.343159,0.250593,-0.158562,1.697222,1.606284,1.408257,1.257198,1.159029


In [21]:
new_data.shape

(2158, 2309)

``After apply KNN this problem get solve but we can also  apply simple LR and SVR as know ``
``But what i think that if possible we get nearest Kth point from  X data that make sense that why i apply KNN   ``

In [22]:
#apply kNN to predict the salary
from sklearn.neighbors import KNeighborsRegressor
from  sklearn.model_selection  import  train_test_split
from  sklearn.metrics  import  mean_squared_error
#import  mean normalized weighted absolute error (MNWAE)
from  sklearn.metrics  import  make_scorer
knn = KNeighborsRegressor(n_neighbors=5)

X=new_data.iloc[:,0:2304]#Normalise text after using  sentence_transformers 
y=new_data.iloc[:,2304:2309]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [23]:
var_,mean_

([112387245.0, 220044299.0, 448678612.0, 845444517.0, 1523729540.0],
 [32137.2613531,
  40012.56255792,
  51940.24096386,
  67975.05560704,
  87937.31232623])

``Loading  MEan and Var ``

In [24]:
import numpy as np  
var_=np.array(var_)
mean_=np.array(mean_)


In [28]:
X_test.head(5)

Unnamed: 0,N_0,N_1,N_2,N_3,N_4,N_5,N_6,N_7,N_8,N_9,...,P_758,P_759,P_760,P_761,P_762,P_763,P_764,P_765,P_766,P_767
997,0.14933,-0.023193,1.604413,0.119275,-0.242201,0.636822,-0.668447,0.997238,0.115016,-0.47612,...,0.365274,-0.149756,0.499679,0.48662,-0.522635,-0.198946,0.172999,0.18923,0.063899,0.271388
361,0.14933,-0.023193,1.604413,0.119275,-0.242201,0.636822,-0.668447,0.997238,0.115016,-0.47612,...,0.814244,-0.010145,0.42115,1.049751,-0.283488,-0.365852,0.480489,-0.176223,0.159697,0.314224
416,-0.098339,-0.543112,2.475527,0.835461,0.666772,0.021136,-0.889064,0.834138,0.172423,-0.10684,...,0.433737,-0.588564,0.599435,0.574548,-0.705447,-0.20639,0.281729,-0.343159,0.250593,-0.158562
1112,0.14933,-0.023193,1.604413,0.119275,-0.242201,0.636822,-0.668447,0.997238,0.115016,-0.47612,...,0.88669,-0.160474,0.45411,0.34845,-0.317521,0.063104,-0.067833,0.014125,-0.000544,0.175693
485,0.14933,-0.023193,1.604413,0.119275,-0.242201,0.636822,-0.668447,0.997238,0.115016,-0.47612,...,0.88669,-0.160474,0.45411,0.34845,-0.317521,0.063104,-0.067833,0.014125,-0.000544,0.175693


``Fitting  the X and Y using  KNN after split data for validation using  Train_test_split  ``

In [73]:
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
y_pred_out=[]
for i in  y_pred:
    x=i*np.power(var_,0.5) + mean_
    y_pred_out.append([round(x,2) for x in x])
# y_pred_out

The model performance is evaluated by the mean normalized weighted absolute error (MNWAE) defined as the following:
$$ MNWAE = \frac{1}{n} \sum_{i=1}^{n} \sum_{j \in \{10, 25, 50, 75, 90\}} w_j \times \frac{|y_{i,j}-z_{i,j}|}{z_{i,j}}$$
where $y_{i,j}$ and $z_{i,j}$ are the model estimation and the ground truth of the $i$-th row and $j$-th percentile, and
$$ w_{10} = w_{90} = 0.1, $$
$$ w_{25} = w_{75} = 0.2, $$
$$ w_{50} = 0.4 $$

``Mean Normalized weighted absolute error ``

In [92]:
def  MNWAE(y_true, y_pred):
    normal=(abs(y_true-y_pred)/y_true)
    w10=w90=0.1
    w25=w75=0.2
    w50=0.4
    w=np.array([w10,w25,w50,w75,w90])
    return np.sum(normal*w)/len(y_true)
    

``De Normalise the Target into  actual value corresponding   ``

In [76]:
y_test_de_scale=(y_test*np.power(var_,0.5) + mean_)
y_train_de_scale=(y_train*np.power(var_,0.5) + mean_)
y_test_de_scale.head(5)

Unnamed: 0,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
997,29289.999995,37079.999998,47430.0,61210.0,78549.999992
361,48420.000027,68840.000016,98809.999996,127829.999996,168290.000072
416,30069.999997,31309.999995,49010.0,60220.000001,75409.999989
1112,22539.999984,35549.999998,48260.0,61350.0,77629.999991
485,23259.999985,28079.999994,34600.000002,46640.000002,60379.999975


``MNWAE Error Between  the actual and predicted value  after apply KNN    ``

In [95]:
MNWAE(y_test_de_scale.values,y_pred_out)

0.1991487949345564

In [96]:
var_=[1.12387245e+08, 2.20044299e+08, 4.48678612e+08, 8.45444517e+08,
        1.52372954e+09]
mean_=[32137.2613531 , 40012.56255792, 51940.24096386, 67975.05560704,
        87937.31232623]

In [97]:
sub=pd.read_csv('./submission.csv')
sub.head()

Unnamed: 0,PRIM_STATE,NAICS_TITLE,OCC_TITLE,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90
0,US,Accommodation and Food Services,Other Production Occupations,10000,20000,30000,40000,50000
1,NE,Cross-industry,"Arts, Design, Entertainment, Sports, and Media...",10000,20000,30000,40000,50000
2,US,Manufacturing,Construction and Extraction Occupations,10000,20000,30000,40000,50000
3,US,Wholesale Trade,Material Moving Workers,10000,20000,30000,40000,50000
4,US,Other Services (except Public Administration),Supervisors of Building and Grounds Cleaning a...,10000,20000,30000,40000,50000


In [98]:
NAICS_TITLE_Embed_sub = model.encode(sub['NAICS_TITLE'])
OCC_TITLE_Embed_sub = model.encode(sub['OCC_TITLE'])
PRIM_STATE_Embed_sub = model.encode(sub['PRIM_STATE'])

In [36]:
new_data_sub=pd.DataFrame(columns=['N_'+str(i)  for i in  range(768)]+['O_'+str(i)  for i in  range(768)]+['P_'+str(i) for i in  range(768)])

for i in  range(768):
    new_data_sub['N_'+str(i)]=NAICS_TITLE_Embed_sub[:,i]
    new_data_sub['O_'+str(i)]=OCC_TITLE_Embed_sub[:,i]
    new_data_sub['P_'+str(i)]=PRIM_STATE_Embed_sub[:,i]

In [51]:
# new_data_sub.shape
y_pred_sub=knn.predict(new_data_sub)
y_sub=[]
for i in  y_pred_sub:
    x=i*np.power(var_,0.5) + mean_
    y_sub.append([round(x,2) for x in x])
    # break


In [55]:
A_PCT10=[]
A_PCT25=[]
A_MEDIAN=[]
A_PCT75=[]
A_PCT90=[]
for i in  y_sub:
    A_PCT10.append(i[0])
    A_PCT25.append(i[1])
    A_MEDIAN.append(i[2])
    A_PCT75.append(i[3])
    A_PCT90.append(i[4])

In [60]:
sub['A_PCT10']=A_PCT10
sub['A_PCT25']=A_PCT25
sub['A_MEDIAN']=A_MEDIAN
sub['A_PCT75']=A_PCT75
sub['A_PCT90']=A_PCT90

In [99]:
sub.to_csv('./submission1.csv')