The goal of this notebook is to make functions to make recommendations for the data we scraped and then choose the function that makes the best recommendations

In [21]:
import pandas as pd

In [22]:
sim = pd.read_csv('../data_for_notebooks/internshala_recommendation_matrix.csv') 
sim_1 = pd.read_csv('../data_for_notebooks/internshala_recommendation_matrix_wo_tfidf.csv')
sim_2 = pd.read_csv('../data_for_notebooks/internshala_recommendation_matrix_w_lem.csv')
sim_3 = pd.read_csv('../data_for_notebooks/internshala_recommendation_matrix_w_lem_wo_tfidf.csv')
df = pd.read_csv('../data_for_notebooks/internshala_dataset.csv')
sim_cat = pd.read_csv('../data_for_notebooks/internshala_recommendation_df_cat.csv')

Thus we have the following dataframes loaded: 

**sim** : similarity matrix with stemming and tfidf

**sim_1** : similarity matrix made without using tfidf but with stemming

**sim_2** : similarity matrix made with lemmatizing and tfidf

**sim_3** : similarity matrix made with lemmatization and without tfidf

**sim_cat** : similarity matrix made using category column and with stemming but without tfidf

**df** : dataframe with all the internships

In [23]:
df.insert(0, 'id', range(0, 0 + len(df)))
df

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
0,0,Sales & Marketing,Decathlon Sport India Private Limited,Chennai,3 Months,10000 /month,19 Apr' 22,Be an early applicant,,"Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/sale...
1,1,Business Development (Sales),Aditya Birla Capital,Kolkata,3 Months,2000-3000 /month,16 Apr' 22,52 applicants,,"Certificate , Letter of recommendation , Job o...",20,https://internshala.com/internship/detail/busi...
2,2,Teaching (Computer Programming),Nimble-Q,Kanpur Dehat,3 Months,1600 /week,24 Apr' 22,Be an early applicant,"Android , Java , Python ,","Certificate , Letter of recommendation , Job o...",1,https://internshala.com/internship/detail/teac...
3,3,Baking - Culinary,Prianka Khushwani,Delhi,6 Months,4000 /month,24 Apr' 22,Be an early applicant,"Culinary Arts ,","Certificate , Letter of recommendation , Flexi...",2,https://internshala.com/internship/detail/baki...
4,4,Web Development,Spring 360 Technology,Bhopal,3 Months,7000-20000 /month,24 Apr' 22,Be an early applicant,".NET , JavaScript , Node.js , PHP , Python , R...","Certificate , Letter of recommendation , Flexi...",8,https://internshala.com/internship/detail/web-...
...,...,...,...,...,...,...,...,...,...,...,...,...
457,457,Human Resources (HR),Graygraph Technologies LLC,Ghaziabad,2 Months,10000 /month,12 Apr' 22,129 applicants,,"Certificate , Letter of recommendation , Flexi...",5,https://internshala.com/internship/detail/huma...
458,458,Business Development (Sales),Synergies Ananta,Jodhpur,2 Months,4000-9000 /month,15 Apr' 22,Be an early applicant,,"Certificate , Letter of recommendation , Flexi...",10,https://internshala.com/internship/detail/busi...
459,459,Campus Management,Ulead,Work From Home,1 Month,2000 /month,15 Apr' 22,1000+ applicants,"English Proficiency (Spoken) , English Profici...","Certificate , Letter of recommendation , Flexi...",10,https://internshala.com/internship/detail/camp...
460,460,Regional Management,Ulead,Work From Home,1 Month,2000 /month,15 Apr' 22,1000+ applicants,"English Proficiency (Spoken) , English Profici...","Certificate , Letter of recommendation , Flexi...",15,https://internshala.com/internship/detail/regi...


In [24]:
# setting id column as index
sim.set_index('id', inplace = True)
sim_1.set_index('id', inplace = True)
sim_2.set_index('id', inplace = True)
sim_3.set_index('id', inplace = True)
sim_cat.set_index('id', inplace = True)

Making the recommendation function:

In [30]:
def make_recs(sim, df, i, n):
    '''
    returns a dataframe of top n recommendations, based on the similarity matrix provided and dataframe 
    provided, for a user who viewed the internship with the ith ID.
    
    INPUT:
    sim - similarity matrix(dataframe)
    df - original dataframe with all the data
    i - id of the internship that was viewed by the user
    n - top n recommendations to be made to the user 
    
    OUTPUT:
    recs_df - dataframe consisting of the recommended internships
    
    '''
    ith_series = sim.loc[:,str(i)]
    ith_series = ith_series.sort_values(ascending = False)
    # print(ith_series)
    recs = ith_series.head(n+1).index.tolist()
    
    # what might happen is that multiple elements attain maximum similarity value. Then it is possible that 
    # we don't get i in our recs. so for that the below is done. 
    if i in recs: 
        recs.remove(i)
    else:
        recs = recs[:-1]
        
    # below ensures that the order of the recommendations is as in recs cause otherwise the use of .isin
    # reorders recs in the way they appear in df, i.e, in ascending order
    recs_df = df[df.id.isin(recs)].set_index('id').T[recs].T.reset_index()
    
    return recs_df

Running an example below for all the similarity matrices :

In [31]:
# original internship viewed by the user
df[df.id == 110]

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
110,110,Technical Content Writing,Sarg,Work From Home,1 Month,7000-10000 /month,20 Apr' 22,109 applicants,"Copywriting , Creative Writing , Digital Marke...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/tech...


### Recommended internships using different dataframes : 

### for sim :

In [32]:
make_recs(sim, df, 110, 3)

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
0,128,Technical Content Writing,Snowball Innovation Labs,Work From Home,1 Month,7000-10000 /month,19 Apr' 22,114 applicants,"Copywriting , Creative Writing , Digital Marke...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/tech...
1,433,Content Creation,Edten,Jaipur,3 Months,2500-5500 /month,13 Apr' 22,123 applicants,"Creative Writing , English Proficiency (Spoken...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/cont...
2,332,Subject Matter Expert (SME),PearlySight,Pune,3 Months,15000-25000 /month,13 Apr' 22,Be an early applicant,"English Proficiency (Written) , Social Media M...","Certificate , Letter of recommendation , Flexi...",20,https://internshala.com/internship/detail/subj...


### for sim_2 :

In [34]:
make_recs(sim_2, df, 110, 3)

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
0,128,Technical Content Writing,Snowball Innovation Labs,Work From Home,1 Month,7000-10000 /month,19 Apr' 22,114 applicants,"Copywriting , Creative Writing , Digital Marke...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/tech...
1,433,Content Creation,Edten,Jaipur,3 Months,2500-5500 /month,13 Apr' 22,123 applicants,"Creative Writing , English Proficiency (Spoken...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/cont...
2,332,Subject Matter Expert (SME),PearlySight,Pune,3 Months,15000-25000 /month,13 Apr' 22,Be an early applicant,"English Proficiency (Written) , Social Media M...","Certificate , Letter of recommendation , Flexi...",20,https://internshala.com/internship/detail/subj...


###  for sim_3 :

In [35]:
make_recs(sim_3, df, 110,3)

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
0,91,Marketing And Communications,Sixth Degree Consulting,Bangalore,2 Months,5000 /month,20 Apr' 22,Be an early applicant,"Creative Writing , English Proficiency (Spoken...","Certificate , Letter of recommendation , Flexi...",3,https://internshala.com/internship/detail/mark...
1,253,International Business Development,Matrix Exports,Chennai,6 Months,5000-10000 /month,16 Apr' 22,30 applicants,"Digital Marketing , Email Marketing , English ...","Certificate , Letter of recommendation , Flexi...",20,https://internshala.com/internship/detail/inte...
2,118,Lead Generation,FreshNames,Work From Home,3 Months,3000 /month,20 Apr' 22,Be an early applicant,"Digital Marketing , Email Marketing , English ...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/lead...


### for sim_cat :

In [36]:
make_recs(sim_cat, df, 110, 3)

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
0,128,Technical Content Writing,Snowball Innovation Labs,Work From Home,1 Month,7000-10000 /month,19 Apr' 22,114 applicants,"Copywriting , Creative Writing , Digital Marke...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/tech...
1,212,Content Writing,The Blue Owl,Work From Home,3 Months,2500-5000 /month,18 Apr' 22,72 applicants,"Blogging , Creative Writing , English Proficie...","Certificate , Letter of recommendation , Flexi...",2,https://internshala.com/internship/detail/cont...
2,355,Content Writing,Veefin Solutions Private Limited,Work From Home,3 Months,10000 /month,13 Apr' 22,30 applicants,"Blogging , Creative Writing , English Proficie...","Certificate , Letter of recommendation , Flexi...",1,https://internshala.com/internship/detail/cont...


Now after a lot of mixing and matching and trying out for different ids, we came upon the decision to take a weighted sum of sim_cat and sim_1 where sim_cat will be as it is and sim_1 will be multiplied by 0.3. we do this because the weighted sum gives us a really good mix of good and relevant recommendations with seredipitous recommendations. A lot of recommendations by sim_1 are a little offbeat and help us in keeping the recommendations fresh and the users interest intact. The decision of taking the weights as 1 and 0.3 have come from experimentation and by looking at the values of sim_1 and sim_cat(sim_1 has the largest value of 520 whereas sim_cat has 5). 

We have done this as recommendation systems are a mix of science and art and this was the art portion where there was no exact way to determine which recommendation system was performing the best.

In [41]:
def combine_make_recs(df, i, n):
    '''
    takes the weighted sum of 2 similarity matrices: sim_cat and sim_1. Returns recommendations using the 
    summed up matrix.
    
    INPUT:
    df - original dataframe with all the data
    i - id of the internship that was viewed by the user
    n - top n recommendations to be made to the user 
    
    OUTPUT:
    recs_df - dataframe consisting of the recommended internships
    
    '''
    sim_combined = 0.3*sim_1 + sim_cat
    recs_df = make_recs(sim_combined, df, i, n)
    return recs_df

In [42]:
combine_make_recs(df, 110, 3)

Unnamed: 0,id,Title,Company,Location,Duration,Stipend,Apply By,Applicants,Skills Required,Perks,Number of Openings,Link
0,128,Technical Content Writing,Snowball Innovation Labs,Work From Home,1 Month,7000-10000 /month,19 Apr' 22,114 applicants,"Copywriting , Creative Writing , Digital Marke...","Certificate , Letter of recommendation , Flexi...",4,https://internshala.com/internship/detail/tech...
1,91,Marketing And Communications,Sixth Degree Consulting,Bangalore,2 Months,5000 /month,20 Apr' 22,Be an early applicant,"Creative Writing , English Proficiency (Spoken...","Certificate , Letter of recommendation , Flexi...",3,https://internshala.com/internship/detail/mark...
2,253,International Business Development,Matrix Exports,Chennai,6 Months,5000-10000 /month,16 Apr' 22,30 applicants,"Digital Marketing , Email Marketing , English ...","Certificate , Letter of recommendation , Flexi...",20,https://internshala.com/internship/detail/inte...
