<a href="https://colab.research.google.com/github/LaunaG/jobs-recommender/blob/fatimaazmat/Copy_of_CareerBuilder_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Setup and Exploration

### **Setup**

**Import Required Libraries**

In [1]:
import glob
import gzip
import os
import pandas as pd
import requests
import zipfile

from google.colab import drive, files

**Load Datasets into Local Colab Storage**

The data is sourced from CareerBuilder.com's [Job Recommendation Challenge](https://www.kaggle.com/c/job-recommendation/data), hosted on Kaggle in 2012.

*Technical Notes:*
 
*   Because the files are large, this cell takes a few minutes to run (< 5 minutes).

*   Unfortunately, data does not persist in the local storage system between sessions, so you have to rerun this cell from time to time.

In [2]:
# If data files are not already in local storage
if not os.path.isdir("data"):

  # Retrieve zip file from Dropbox and write to base/default folder
  r = requests.get("https://www.dropbox.com/s/v2fdobitjrjieku/data.zip?dl=1")
  with open("data.zip", 'wb') as f:
      f.write(r.content)

  # Extract zip file contents to create local data folder with .tsv.gz files
  with zipfile.ZipFile("data.zip", 'r') as zip_ref:
      zip_ref.extractall(".")

  # For each unzipped file path
  for path in glob.glob("data/*.tsv.gz"):

    # Create destination file path
    dest_path = f'data/{os.path.basename(path)[:-3]}'

    # Open unzipped file for reading and destination file for writing
    with open(path, 'rb') as f:
      with open(dest_path, 'wb') as g:

            # Decompress unzipped file data and write to destination
            decompressed = gzip.decompress(f.read())
            g.write(decompressed)

    # Delete original compressed file
    os.remove(path)

  # Delete zip file
  os.remove("data.zip")

### **Exploration**

**Users**

Potential disadvantaged groups to examine:


*   Users who have a high-school diploma or less
*   Users based in zip codes associated with lower incomes/mobility
*   Users whose graduation date would put them in an older age bracket

*users.tsv - Holds all users and their metadata*

In [3]:
# File Preview
users = pd.read_csv("data/users.tsv", sep="\t")
users.head(5)

Unnamed: 0,UserID,WindowID,Split,City,State,Country,ZipCode,DegreeType,Major,GraduationDate,WorkHistoryCount,TotalYearsExperience,CurrentlyEmployed,ManagedOthers,ManagedHowMany
0,47,1,Train,Paramount,CA,US,90723,High School,,1999-06-01 00:00:00,3,10.0,Yes,No,0
1,72,1,Train,La Mesa,CA,US,91941,Master's,Anthropology,2011-01-01 00:00:00,10,8.0,Yes,No,0
2,80,1,Train,Williamstown,NJ,US,8094,High School,Not Applicable,1985-06-01 00:00:00,5,11.0,Yes,Yes,5
3,98,1,Train,Astoria,NY,US,11105,Master's,Journalism,2007-05-01 00:00:00,3,3.0,Yes,No,0
4,123,1,Train,Baton Rouge,LA,US,70808,Bachelor's,Agricultural Business,2011-05-01 00:00:00,1,9.0,Yes,No,0


In [4]:
# Degree type counts
users["DegreeType"].value_counts().to_frame()

Unnamed: 0,DegreeType
Bachelor's,104210
,100153
High School,93305
Associate's,45786
Master's,35330
Vocational,6981
PhD,3943


In [5]:
# Total number of users in dataset
len(users)

389708

*users_history.tsv - Holds users' past job title(s)*

In [6]:
# File preview
user_history = pd.read_csv("data/user_history.tsv", sep="\t")
user_history.head(5)

Unnamed: 0,UserID,WindowID,Split,Sequence,JobTitle
0,47,1,Train,1,National Space Communication Programs-Special ...
1,47,1,Train,2,Detention Officer
2,47,1,Train,3,"Passenger Screener, TSA"
3,72,1,Train,1,"Lecturer, Department of Anthropology"
4,72,1,Train,2,Student Assistant


In [7]:
# Example job titles for a random user
list(user_history.query("UserID == 47")["JobTitle"])

['National Space Communication Programs-Special Program Supervisor',
 'Detention Officer',
 'Passenger Screener, TSA']

**Jobs**

*jobs.tsv: Holds the jobs available on CareerBuilder.com during a 13-day window*

In [8]:
# File preview for jobs listed in the first of the seven windows
# Note: This file has an error in one of its lines that should be corrected
jobs1 = pd.read_csv("data/jobs1.tsv", sep="\t", error_bad_lines=False)
jobs1.head(5)

b'Skipping line 122433: expected 11 fields, saw 12\n'
  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,JobID,WindowID,Title,Description,Requirements,City,State,Country,Zip5,StartDate,EndDate
0,1,1,Security Engineer/Technical Lead,<p>Security Clearance Required:&nbsp; Top Secr...,<p>SKILL SET</p>\r<p>&nbsp;</p>\r<p>Network Se...,Washington,DC,US,20531.0,2012-03-07 13:17:01.643,2012-04-06 23:59:59
1,4,1,SAP Business Analyst / WM,<strong>NO Corp. to Corp resumes&nbsp;are bein...,<p><b>WHAT YOU NEED: </b></p>\r<p>Four year co...,Charlotte,NC,US,28217.0,2012-03-21 02:03:44.137,2012-04-20 23:59:59
2,7,1,P/T HUMAN RESOURCES ASSISTANT,<b> <b> P/T HUMAN RESOURCES ASSISTANT</b> <...,Please refer to the Job Description to view th...,Winter Park,FL,US,32792.0,2012-03-02 16:36:55.447,2012-04-01 23:59:59
3,8,1,Route Delivery Drivers,CITY BEVERAGES Come to work for the best in th...,Please refer to the Job Description to view th...,Orlando,FL,US,,2012-03-03 09:01:10.077,2012-04-02 23:59:59
4,9,1,Housekeeping,I make sure every part of their day is magica...,Please refer to the Job Description to view th...,Orlando,FL,US,,2012-03-03 09:01:11.88,2012-04-02 23:59:59


In [9]:
# Number of jobs in first window
len(jobs1)

285091

**Apps**

*apps.tsv: Holds the applications users submitted*

In [10]:
# File preview
apps = pd.read_csv("data/apps.tsv", sep="\t")
apps.head(5)

Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748


In [11]:
#Look at the variety in users and jobs
user_history['JobTitle'].value_counts()

Customer Service Representative                                      19672
Cashier                                                              16368
Administrative Assistant                                             16228
Sales Associate                                                      15645
Assistant Manager                                                    11712
                                                                     ...  
Accounting/HR and Office Manager/Administrator                           1
Forklift/Clamp/Slip sheet operator                                       1
Answered Phones, Organized & Updated Database, and General Office        1
Cake Decorator/ Shift Leader                                             1
Specialist - Short Sale / Deed in Lieu Department                        1
Name: JobTitle, Length: 657155, dtype: int64

In [13]:
jobs1['Title'].value_counts()

Own Your Own Franchise!                                          1764
Customer Service Representative                                  1152
Administrative Assistant                                         1141
Sales / Franchise                                                1050
Account Representative                                           1008
                                                                 ... 
CLERK  Associate's degree or higher with coursework in              1
Software Developer - Lotus Notes (TS/SCI FS POLY Clearance)         1
Direct Care Counselor -  Mental Health (Part-Time Weeknights)       1
Sr User Experience Designer                                         1
Business Analyst II - Consumer - Marsh - Urbandale, IA              1
Name: Title, Length: 136231, dtype: int64

In [None]:
# Distribution of the number of applications submitted per user
apps["UserID"].value_counts().describe().to_frame().rename(columns={"UserID": "App Submissions"})

Unnamed: 0,App Submissions
count,321235.0
mean,4.990462
std,11.418487
min,1.0
25%,1.0
50%,2.0
75%,5.0
max,2473.0


**Window Dates**

*window_dates.tsv: Holds the application window dates*

In [14]:
# File preview
window_dates = pd.read_csv("data/window_dates.tsv", sep="\t")
window_dates.head(5)

Unnamed: 0,Window,Train Start,Train End / Test Start,Test End
0,1,2012-04-01 00:00:00,2012-04-10 00:00:00,2012-04-14 00:00:00
1,2,2012-04-14 00:00:00,2012-04-23 00:00:00,2012-04-27 00:00:00
2,3,2012-04-27 00:00:00,2012-05-06 00:00:00,2012-05-10 00:00:00
3,4,2012-05-10 00:00:00,2012-05-19 00:00:00,2012-05-23 00:00:00
4,5,2012-05-23 00:00:00,2012-06-01 00:00:00,2012-06-05 00:00:00


In [None]:
#Use apps to group by User ID and list all the job IDSs associated with each user
apps["JobID"].value_counts()

17361      208
900797     203
67239      189
601021     188
601126     186
          ... 
1083440      1
1067048      1
1060903      1
1114113      1
2049         1
Name: JobID, Length: 365668, dtype: int64

In [None]:
#Note: There are 365,668 unique Job ID's, but even the MOST frequently repeated one is only 208 times

In [15]:
apps.groupby("UserID")["JobID"].count().to_frame()

Unnamed: 0_level_0,JobID
UserID,Unnamed: 1_level_1
7,2
9,3
13,1
14,6
16,2
...,...
1472079,5
1472085,1
1472089,12
1472090,2


In [16]:
apps.groupby("UserID")["JobID"].head(100)

0           169528
1           284009
2             2121
3           848187
4           733748
            ...   
1603106     573732
1603107      39401
1603108     175198
1603109    1073263
1603110     646949
Name: JobID, Length: 1576348, dtype: int64

In [None]:
#List of unique User ID's to be used as columns in new dataframe
user_id_list = apps.UserID.unique()
job_id_list = apps.JobID.unique()

In [17]:
apps2 = apps.head(100)
apps2

Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID
0,47,1,Train,2012-04-04 15:56:23.537,169528
1,47,1,Train,2012-04-06 01:03:00.003,284009
2,47,1,Train,2012-04-05 02:40:27.753,2121
3,47,1,Train,2012-04-05 02:37:02.673,848187
4,47,1,Train,2012-04-05 22:44:06.653,733748
...,...,...,...,...,...
95,554,1,Train,2012-04-02 05:08:41.31,855139
96,554,1,Train,2012-04-02 05:08:44.563,149199
97,554,1,Train,2012-04-02 05:08:41.717,449029
98,554,1,Train,2012-04-02 10:59:19.397,627377


In [18]:
df = pd.get_dummies(apps2, columns=['JobID']).groupby('UserID', as_index=False).max()

In [19]:
df.head()

Unnamed: 0,UserID,WindowID,Split,ApplicationDate,JobID_18,JobID_40,JobID_957,JobID_2121,JobID_13175,JobID_13298,JobID_25820,JobID_28139,JobID_35071,JobID_50183,JobID_69667,JobID_135040,JobID_142459,JobID_146817,JobID_148976,JobID_149199,JobID_153632,JobID_169528,JobID_180313,JobID_196603,JobID_251966,JobID_262470,JobID_271546,JobID_280275,JobID_283948,JobID_283949,JobID_284009,JobID_300007,JobID_300020,JobID_300053,JobID_314495,JobID_316374,JobID_328100,JobID_336293,JobID_366888,JobID_381246,...,JobID_654538,JobID_680718,JobID_688863,JobID_717481,JobID_733748,JobID_747584,JobID_752100,JobID_758079,JobID_766183,JobID_784093,JobID_802921,JobID_811833,JobID_812337,JobID_817048,JobID_822835,JobID_834662,JobID_848187,JobID_855139,JobID_871031,JobID_871066,JobID_908909,JobID_920491,JobID_932921,JobID_946506,JobID_1008042,JobID_1008052,JobID_1020903,JobID_1032422,JobID_1042648,JobID_1066757,JobID_1075341,JobID_1078274,JobID_1080147,JobID_1080148,JobID_1091388,JobID_1091719,JobID_1092900,JobID_1098779,JobID_1102826,JobID_1113088
0,47,1,Train,2012-04-06 01:03:00.003,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,72,1,Train,2012-04-30 20:05:15.293,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,80,1,Train,2012-04-04 10:53:19.847,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,123,1,Train,2012-04-02 21:03:45.093,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
4,131,1,Train,2012-04-05 17:09:34.33,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [20]:
df = df.drop(['WindowID', 'Split', 'ApplicationDate'], axis=1)

In [21]:
df.head()

Unnamed: 0,UserID,JobID_18,JobID_40,JobID_957,JobID_2121,JobID_13175,JobID_13298,JobID_25820,JobID_28139,JobID_35071,JobID_50183,JobID_69667,JobID_135040,JobID_142459,JobID_146817,JobID_148976,JobID_149199,JobID_153632,JobID_169528,JobID_180313,JobID_196603,JobID_251966,JobID_262470,JobID_271546,JobID_280275,JobID_283948,JobID_283949,JobID_284009,JobID_300007,JobID_300020,JobID_300053,JobID_314495,JobID_316374,JobID_328100,JobID_336293,JobID_366888,JobID_381246,JobID_403896,JobID_449029,JobID_449169,...,JobID_654538,JobID_680718,JobID_688863,JobID_717481,JobID_733748,JobID_747584,JobID_752100,JobID_758079,JobID_766183,JobID_784093,JobID_802921,JobID_811833,JobID_812337,JobID_817048,JobID_822835,JobID_834662,JobID_848187,JobID_855139,JobID_871031,JobID_871066,JobID_908909,JobID_920491,JobID_932921,JobID_946506,JobID_1008042,JobID_1008052,JobID_1020903,JobID_1032422,JobID_1042648,JobID_1066757,JobID_1075341,JobID_1078274,JobID_1080147,JobID_1080148,JobID_1091388,JobID_1091719,JobID_1092900,JobID_1098779,JobID_1102826,JobID_1113088
0,47,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,72,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,123,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
4,131,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [22]:
df = df.set_index('UserID')
df.head()

Unnamed: 0_level_0,JobID_18,JobID_40,JobID_957,JobID_2121,JobID_13175,JobID_13298,JobID_25820,JobID_28139,JobID_35071,JobID_50183,JobID_69667,JobID_135040,JobID_142459,JobID_146817,JobID_148976,JobID_149199,JobID_153632,JobID_169528,JobID_180313,JobID_196603,JobID_251966,JobID_262470,JobID_271546,JobID_280275,JobID_283948,JobID_283949,JobID_284009,JobID_300007,JobID_300020,JobID_300053,JobID_314495,JobID_316374,JobID_328100,JobID_336293,JobID_366888,JobID_381246,JobID_403896,JobID_449029,JobID_449169,JobID_473911,...,JobID_654538,JobID_680718,JobID_688863,JobID_717481,JobID_733748,JobID_747584,JobID_752100,JobID_758079,JobID_766183,JobID_784093,JobID_802921,JobID_811833,JobID_812337,JobID_817048,JobID_822835,JobID_834662,JobID_848187,JobID_855139,JobID_871031,JobID_871066,JobID_908909,JobID_920491,JobID_932921,JobID_946506,JobID_1008042,JobID_1008052,JobID_1020903,JobID_1032422,JobID_1042648,JobID_1066757,JobID_1075341,JobID_1078274,JobID_1080147,JobID_1080148,JobID_1091388,JobID_1091719,JobID_1092900,JobID_1098779,JobID_1102826,JobID_1113088
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
47,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
72,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
123,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
131,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [23]:
#Convert dataframe to numpy array
import numpy as np
matrix = df.to_numpy()

In [24]:
from scipy.sparse.linalg import svds
#matrix = matrix.astype(float)
#U, sigma, Vt = svds(matrix, k = 5)

In [None]:
#sigma = np.diag(sigma)

In [None]:
# N: num of Users
N = len(matrix)
# M: num of Movie
M = len(matrix[0])
# Num of Features
K = 5

P = np.random.rand(N,K)
Q = np.random.rand(M,K)

In [None]:
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
    '''
    R: rating matrix
    P: |U| * K (User features matrix)
    Q: |D| * K (Item features matrix)
    K: latent features
    steps: iterations
    alpha: learning rate
    beta: regularization parameter'''
    Q = Q.T

    for step in range(steps):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    # calculate error
                    eij = R[i][j] - np.dot(P[i,:],Q[:,j])

                    for k in range(K):
                        # calculate gradient with a and beta parameter
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])

        eR = np.dot(P,Q)

        e = 0

        for i in range(len(R)):

            for j in range(len(R[i])):

                if R[i][j] > 0:

                    e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)

                    for k in range(K):

                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        # 0.001: local minimum
        if e < 0.001:

            break

    return P, Q.T

In [None]:
nP, nQ = matrix_factorization(matrix, P, Q, K)

nR = np.dot(nP, nQ.T)

In [None]:
nR

array([[1.3168755 , 1.04247154, 1.02764828, ..., 1.08430754, 1.43459772,
        1.16089616],
       [0.95586457, 0.83638805, 0.77988591, ..., 0.84267134, 1.20541692,
        0.8975706 ],
       [1.55252824, 1.03693243, 0.76167209, ..., 0.75516726, 1.29030758,
        1.30052765],
       ...,
       [1.20942722, 1.35651666, 1.02600772, ..., 0.92647292, 1.10511285,
        1.12119506],
       [1.64520021, 1.78917799, 1.00692559, ..., 0.78229036, 0.94089175,
        1.31520647],
       [1.01154254, 0.99456085, 0.95110852, ..., 0.92809269, 1.12267931,
        0.99723973]])

In [35]:
from sklearn.decomposition import NMF
model = NMF(n_components = 19, init='random', random_state=0)
W = model.fit_transform(matrix)
H = model.components_

In [28]:
matrix.shape

(19, 99)

In [36]:
H.shape

(19, 99)

In [33]:
matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 1]], dtype=uint8)

In [32]:
H

array([[0.04628651, 0.03671022, 0.03829654, ..., 0.        , 0.        ,
        0.03107992],
       [0.01184851, 0.04441251, 0.03212933, ..., 0.        , 0.        ,
        0.02116656],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [37]:
dists = np.abs(H - matrix)

In [38]:
np.linalg.norm(dists)

16.878763449852247

In [39]:
model = NMF(n_components = 19, init='nndsvd', random_state=0)
W2 = model.fit_transform(matrix)
H2 = model.components_

In [42]:
dists2 = np.abs(H2 - matrix)
np.linalg.norm(dists2)

11.621242654848459

In [43]:
H2

array([[0.37800619, 0.37800619, 0.37800619, ..., 0.        , 0.        ,
        0.37800619],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.98932751, ..., 0.        , 0.        ,
        0.        ]])