# This is a working model using cosine similarities to recommend repositories
* Using the cosine similarity between repositories, we can recommend repositories to users who prefer a specific language type.
* For instance, if a user prefers Python with computer vision, we can recommend repositories that are python and computer vision-related, and so on.

**To implement:** 

* It would be great to add more user info, such as preferred sub-topics sing the `topics` column, and so on.

### TF-IDF Vectorization
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a term in a document within a collection or corpus.

The TF-IDF value of a term is calculated by multiplying two factors: term frequency (TF) and inverse document frequency (IDF).

1. Term Frequency (TF): Term frequency measures the frequency of a term within a document. It is calculated by dividing the number of occurrences of a term in a document by the total number of terms in that document. 

- TF(t, d) = (Number of occurrences of term t in document d) / (Total number of terms in document d)

2. verse Document Frequency (IDF): Inverse document frequency measures the rarity of a term across the entire corpus. It is calculated by taking the logarithm of the total number of documents divided by the number of documents containing the term. 

- IDF(t) = log((Total number of documents) / (Number of documents containing term t))

Once we have the TF and IDF values for each term in a document, we can calculate the TF-IDF value by multiplying them together:

TF-IDF(t, d) = TF(t, d) * IDF(t)

In our project, we are using TF-IDF to vectorize textual data. Vectorization is the process of converting text into numerical vectors that machine learning algorithms can understand. By representing documents as TF-IDF vectors, we can capture the importance of different terms within each document and compare documents based on their content.

### Cosine Similarity Explanation: 
Cosine similarity is a measure of similarity between two vectors in a multi-dimensional space. It calculates the cosine of the angle between the two vectors, which represents their similarity. The value of cosine similarity ranges from -1 to 1, where 1 indicates that the vectors are identical, 0 indicates that the vectors are orthogonal (completely dissimilar), and -1 indicates that the vectors are diametrically opposed. To calculate the cosine similarity, you need to compute the dot product of the two vectors and divide it by the product of their magnitudes.

In the context of this project, we are representing the textual preferences of the user and repositories as vectors. Each vector represents the frequency or presence of certain words or features in the textual data. By calculating the cosine similarity between the user's entered textual preference vector and the repository vectors, we can determine how similar they are. Then we rank repos by similarity score and recommend the top few ones.

In [2]:
import sys
import pandas as pd
from pandas import DataFrame
# go up two directories
sys.path.append('../../../')

from codecompasslib.models.cosine_similarity_model import load_data, clean_data, recommend_repos

In [3]:
# load the data
df = load_data('1Qiy9u03hUthqaoBDr4VQqhKwtLJ2O3Yd')


Download 11%.

Download 23%.

Download 35%.

Download 47%.

Download 59%.

Download 71%.

Download 83%.

Download 95%.

Download 100%.


  return read_csv(fh)


In [8]:
# turn df into DataFrame type and clean it
df = pd.DataFrame(df)
df_cleaned = clean_data(df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['description'] = df['description'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['name'] = df['name'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['language'] = df['language'].fillna('')


In [6]:
# User preference (textual input)
user_preference = "python with computer vision"

In [7]:
recommended_repos = recommend_repos(user_preference, df_cleaned, top_n=10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cosine_similarity_score'] = cosine_scores


In [13]:
recommended_repos

Unnamed: 0,name,description,language,cosine_similarity_score
0,computer-vision,Computer Vision Applications,Python,0.895181
1,Computer-Vision,Computer Vision,,0.891907
2,Computer-Vision-with-Python-3,"Computer Vision with Python 3, published by Packt",Python,0.889605
3,Computer-Vision-with-Python-3,"Computer Vision with Python 3, published by Packt",,0.888058
4,Computer-Vision-with-Python-3,"Computer Vision with Python 3, published by Packt",,0.888058
5,Computer-Vision-with-Python-3,"Computer Vision with Python 3, published by Packt",,0.888058
6,computer-vision,No description,Python,0.884376
7,computer-vision,No description,Python,0.884376
8,Python-3.x-for-Computer-Vision,No description,,0.863562
9,Computer-vision,Computer Vision implementation,,0.860605
