# DL836 - Creative Computing - Year 4 - Artificial Intelligence Assignment 01
### Caolán O'Hagan - N00181501

## 1. Introduction

Recommender systems are programs that use machine learning algorithms and techniques to take in data, process it, and return data that are similar based on the characteristics of the original input data. Recommender systems filter, prioritize and deliver relevant information. They help combat information overload, which is the excess of information a person can be exposed to, especially on the internet. To help with information overload, recommender systems have become widespread within the online retail and eCommerce industries, streaming services such as Netflix, Amazon, Hulu, and YouTube. Recommender Systems and algorithms are present in social media to populate new feeds and explore pages with new and relevant content.

## 2. Applications of AI

### 2.1 Robotics

The best example of artificial intelligence and robotics integration is self-driving vehicles. The first successful display of autonomous vehicles was the 132-mile DARPA Grand challenge in 2005. An Autonomous vehicle drove on dirt roads. Waymo test vehicles achieved 10 million miles on public roads without any accidents in 2018. Waymo test vehicles used supervised artificial intelligence with a human driver assuming control every 6,000 miles. Tesla is the leading manufacturer of electric cars that utilize consumer-ready artificial intelligence self-driving software in their vehicles. The advantages of using artificial intelligence in vehicles are reduced accidents, higher traffic efficiency, and access to transport for physically impaired people (“Advantages and disadvantages of autonomous vehicles,” 2021). The limitations of this software and implementing artificial intelligence in cars, in general, is the high cost of implementation and production (“Advantages and disadvantages of autonomous vehicles,” 2021). Artificial intelligence drives up the price of vehicles and can alienate a portion of a company's consumer base. Companies will need to create software that is easily scalable and updatable over time which is new for the car industry. Artificial intelligence programs require a substantial amount of computational performance (“Advantages and disadvantages of autonomous vehicles,” 2021). This can be an issue because the computer portion of the vehicle must share power and space, and other resources with the mechanical components of the vehicle.

### 2.1 Finance

Artificial Intelligence is used within the finance industry for risk assessment and management, fraud detection, credit decisions, and trading. Banks are utilizing machine learning to assess and determine whether applications are suitable candidates for loans. Artificial intelligence is used to analyse a  person’s buying behaviour and flag possible fraudulent behaviour (“Artificial Intelligence in Finance [15 Examples],” 2021). Artificial intelligence analyses large trading datasets to predict market trends and behaviours. The main limitations of artificial in finance revolve around privacy, cost, and misuse of data. When it comes to trading and artificial intelligence cannot beat human intuition or common sense. Outside of customer care chatbots, a lot of ai software used in finance is supervised.

### 2.1 Medicine

Artificial intelligence in the Healthcare sector is beginning to be used in numerous ways. Artificial intelligence is used in different ways to diagnose and reduce human error, much like artificial intelligence in cars. It is used in cancer diagnosis, general illness diagnosis, intelligent symptom checkers, and developing new medicines. Companies now specialize in developing new drugs using artificial intelligence. Berg Health uses artificial intelligence to scan vast amounts of data and map diseases to accelerate the development of medicines (“37 Examples of AI in Healthcare That Will Make You Feel Better About the Future,” 2014). Bioxcel Therapeutics uses artificial intelligence to find new applications for existing drugs (“37 Examples of AI in Healthcare That Will Make You Feel Better About the Future,” 2014).


## 3. Recommender Systems

Recommender systems can be split into three main types. Demographic/Popularity-Based, Content-based and Collaborative-based systems. 

### 3.1 Content-Based Approach

>**2.Content Based Filtering** - Suggests similar items based on a particular item (ibtesama, 2020). For example, content-based filtering would use item metadata and characteristics of a movie database such as genre, descriptions, and actors. It uses the features of the movie to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, they will also want an item with similar characteristics and metadata.

The content-based approach is based on a semantic search. This semantic search is primarily for information retrieval(ibtesama, 2020). Suppose a user liked a particular film with a specific director or a book by a particular author. In that case, the content-based approach will analyse the desired item and return movies or books by the same director or author with similar plot descriptions or blurbs. Content-based filtering gives exclusive ratings and results based on the user's input. The model doesn't need any data about other users, since the recommendations are specific to this user. This makes it easier to scale to a large number of users (“Content-based Filtering Advantages & Disadvantages,” 2018).

The quality of content-based filtering also depends on the quality of data and its attributes. Some data, for example, books and book reviews, work very well with a content-based approach. Content-based filtering can draw on a book's author, blurb, description, themes, and publisher to find similarities and present recommendations from that. If the dataset lacks sufficient attributes or creating the characteristics of an item is difficult, then the content-based approach will not perform well. The content-based approach does not take in interaction data. It only uses statistical data. The has limited ability to expand on the user's existing interest (“Content-based Filtering Advantages & Disadvantages,” 2018).

### 3.2 Collaborative Approach

>**2.Collaborative Filtering** - Collaborative filtering uses similarities between users and items simultaneously to provide recommendations.

Collaborative filtering only needs a user's historical preference on a set of data. Collaborative filtering focuses on the relationship between user and item. The system uses information from data collection and other user interactions ("Content-based Filtering Advantages & Disadvantages," 2018). The system recommends items based on the similarity rating of those items by users who have rated both items.  Collaborative filtering follows the logic that if user A likes the exact item as user B, then user A is likely to like another item picked by user B ("A Simple Introduction to Collaborative Filtering," 2019). Unlike content-based filtering, collaborative filtering can help users discover new things. 

Collaborative filtering can be split into two types, user-based and item-based. User-based collaborative filtering recommends items to a user that similar users have liked. This is done using a Pearson correlation or cosine similarity score (ibtesama, 2020). Item-based collaborative filtering compairs the similarity of items the user has already rated. A Pearson correlation or cosine similarity is used to compute this similarity too (ibtesama, 2020).


## 4. Programming Book Recommender System

The recommend system created for this report recommends programming books. The Data set used is the Top 270 Computer Science / Programing Books. This dataset store :

- Book Title
- Description
- Number of pages
- Type (Hardback, paperback etc)
- Price
- Reviews
- Ratings

>**[The source for this dataset can be found here](https://www.kaggle.com/thomaskonstantin/top-270-rated-computer-science-programing-books)**


### 4.1 Description

This recommender system will compute pairwise similarity scores for all the programming books based on their descriptions. The system will then recommend books based on this similarity score. 

### 4.2 Implementation

- The first step of building this recommender system is to import the python packages and datasets that this system will be working with. Pandas is an open-source software library used for data manipulation and analysis. 
- Numpy is a scientific computing library for manipulating large data sets. 
- Scikit-learn is a machine learning library. This recommender system specifically uses the Tfidfvectorizer from it. 
- The dataset is read using the pandas function read_csv(), and the dataset is passed in. The dataset is assigned to the variable books, and the books variable is then printed to show the data.


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer 

books = pd.read_csv('prog_book.csv')
books

Unnamed: 0,Rating,Reviews,Book_title,Description,Number_Of_Pages,Type,Price
0,4.17,3829,The Elements of Style,This style manual offers practical advice on i...,105,Hardcover,9.323529
1,4.01,1406,"The Information: A History, a Theory, a Flood","James Gleick, the author of the best sellers C...",527,Hardcover,11.000000
2,3.33,0,Responsive Web Design Overview For Beginners,In Responsive Web Design Overview For Beginner...,50,Kindle Edition,11.267647
3,3.97,1658,Ghost in the Wires: My Adventures as the World...,If they were a hall of fame or shame for compu...,393,Hardcover,12.873529
4,4.06,1325,How Google Works,Both Eric Schmidt and Jonathan Rosenberg came ...,305,Kindle Edition,13.164706
...,...,...,...,...,...,...,...
266,3.76,0,3D Game Engine Architecture: Engineering Real-...,Dave Eberly's 3D Game Engine Design was the fi...,752,Hardcover,203.108823
267,3.94,22,An Introduction to Database Systems,"Continuing in the eighth edition, An Introduct...",1040,Paperback,212.097059
268,4.49,36,"The Art of Computer Programming, Volumes 1-3 B...",Knuth's classic work has been widely acclaimed...,896,Boxed Set - Hardcover,220.385294
269,4.77,4,"The Art of Computer Programming, Volumes 1-4a ...","""The bible of all fundamental algorithms and t...",3168,Hardcover,220.385294


- The books description column is printed out to look at the descriptions this recommender will be using.

In [2]:
books['Description']

0      This style manual offers practical advice on i...
1      James Gleick, the author of the best sellers C...
2      In Responsive Web Design Overview For Beginner...
3      If they were a hall of fame or shame for compu...
4      Both Eric Schmidt and Jonathan Rosenberg came ...
                             ...                        
266    Dave Eberly's 3D Game Engine Design was the fi...
267    Continuing in the eighth edition, An Introduct...
268    Knuth's classic work has been widely acclaimed...
269    "The bible of all fundamental algorithms and t...
270    Designed to help individual programmers develo...
Name: Description, Length: 271, dtype: object

#### Text Processing

- With the data and necessary packages successfully loaded it is time to process the text. This recommender will measure the term frequency (TF) of the descriptions. Term frequency is how often a word appears in a document, divided by how many words there are in document as a whole.

- Inverse Dense Frequency (IDF) evaluates the words used in a sentence and measures its usage compared to words used in the entire document. 

- The TF score is mulitplied by the IDF score to give the TF-IDF score. TF-IDF score reflects how relevant a word is in a document. This score can then be used to draw similarities between the books.

- The TidfVectorizer imported from Scikit-learn library removes all the english stop words. These are word like 'a', and 'the'. The TF-IDF object is then assigned to the variable tfidf.

- A NaN are replaced with empty strings as they useless and could cause an error.

- The TF-IDF matrix is created using the fit_transform method and the descriptions from books passed through. The matrix is assigned to the variable tfidf_matrix. A matrix is a two dimensional array which will hold the words from the descriptions. Each column in the matrix will represent a word in the descriptions vocabulary. each row in the matrix will represent a programming book. 

- The is done because the TF-IDF weight negates the effect of high frequency words in the descrioption which could effect the similarity score further down. 

- The shape of the matrix is then printed out. There are 3,019 words used in the descriptions of the 271 programming books.

In [3]:
# Define a TF-IDF Vectorizer Object. Remove all the stop words
tfidf = TfidfVectorizer(stop_words = 'english')

# Replace ugly NaN's with empty Strings
books['Description'] = books['Description'].fillna('')

# Construct  the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(books['Description'])

# Output the shape
tfidf_matrix.shape

(271, 3019)

- With the TFIDF matrix, the next step is to calculate the similarity scores. 
- There are a couple of different options for calculating the similarity scores. These include the Pearson, the cosine, and the Euclidean similarity score. No one correlation is better than the other. It depends on the situation. - This recommender system uses the cosine similarity score.
- The cosine similarity measures how similar documents are irrespective of their size. It does this by measuring the cosine angle between two vectors projected in a multi-dimensional space. 
- We have a tfidf matrix that contains the description of the programming books. The descriptions have been transformed into feature vectors by using the TfidfVectorizer that was imported from the SciKit-learn library.
- Since we have used the TfidfVectorizer, the recommender system only needs to calculate the dot product, this will give us the cosine similarity score.
- To calculate this the linear_kernel is imported from SciKit-learn library. Linear kernel are used when data is linearly separable. Our vector pairs from our matrix can be separated linearly so this will give the cosine similarity score. 
- The linear_kernel method is passed the tfidf_matrix and the result is assigned to the variable cosine_sim
- The cosine_sim is then turned into a dataframe and assigned to the cariable cosine_sim_df so the scores can be viewed.



In [4]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine simliarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim_df = pd.DataFrame(cosine_sim)

cosine_sim_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,261,262,263,264,265,266,267,268,269,270
0,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.007516,...,0.000000,0.010810,0.000000,0.023133,0.020086,0.008974,0.000000,0.000000,0.000000,0.057008
1,0.000000,1.000000,0.008134,0.000000,0.000000,0.000000,0.000000,0.013308,0.000000,0.000000,...,0.016330,0.000000,0.012456,0.028450,0.000000,0.011599,0.000000,0.028423,0.033454,0.000000
2,0.000000,0.008134,1.000000,0.000000,0.006023,0.000000,0.000000,0.000000,0.000000,0.012671,...,0.000000,0.060804,0.008352,0.009111,0.000000,0.025238,0.000000,0.000000,0.000000,0.022181
3,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.007620,0.000000,0.022561,...,0.000000,0.000000,0.028420,0.000000,0.007389,0.000000,0.000000,0.022857,0.008430,0.000000
4,0.000000,0.000000,0.006023,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.006577,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.015704,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,0.008974,0.011599,0.025238,0.000000,0.000000,0.010956,0.000000,0.024163,0.006416,0.032506,...,0.038077,0.034431,0.161763,0.121087,0.036881,1.000000,0.022102,0.047900,0.026906,0.054047
267,0.000000,0.000000,0.000000,0.000000,0.000000,0.036009,0.000000,0.000000,0.007033,0.000000,...,0.000000,0.110610,0.007374,0.008045,0.063980,0.022102,1.000000,0.041171,0.000000,0.019356
268,0.000000,0.028423,0.000000,0.022857,0.000000,0.018840,0.028311,0.076058,0.000000,0.042404,...,0.081304,0.022621,0.020226,0.036969,0.081284,0.047900,0.041171,1.000000,0.101464,0.000000
269,0.000000,0.033454,0.000000,0.008430,0.015704,0.000000,0.000000,0.025529,0.000000,0.000000,...,0.059353,0.000000,0.075426,0.028917,0.034910,0.026906,0.000000,0.101464,1.000000,0.052103


- The recommender system needs to be able to identify the index of a book given it title.
- To do this the recommender reverse maps the titles to the book indexes. 

In [5]:
# Construct a reverse map of indices and book titles

indices = pd.Series(books.index, index = books['Book_title']).drop_duplicates()

indices_data_frame = pd.DataFrame(indices)

indices_data_frame

Unnamed: 0_level_0,0
Book_title,Unnamed: 1_level_1
The Elements of Style,0
"The Information: A History, a Theory, a Flood",1
Responsive Web Design Overview For Beginners,2
Ghost in the Wires: My Adventures as the World's Most Wanted Hacker,3
How Google Works,4
...,...
3D Game Engine Architecture: Engineering Real-Time Applications with Wild Magic (The Morgan Kaufmann Series in Interactive 3d Technology),266
An Introduction to Database Systems,267
"The Art of Computer Programming, Volumes 1-3 Boxed Set",268
"The Art of Computer Programming, Volumes 1-4a Boxed Set",269


- The function recommendations_cosine is defined. 
- This function takes in a book title and the list of cosine similarity scores.
- It finds the index of the book using its title and assigns the result to book_idx.
- it creates a list and enummerate loops through the cosine_sim using the book_idx. The result of this is assigned to the variable sim_book_scores.
- The books are then sorted based on the similarity scores.
- A list of the ten most similar books are assigned to the sim_book_scores.
- The book indices are assigned to the variable book_indices by looping through sim_book_scores
- The function returns ten similar books to be printed out.

In [6]:
def recommendations_cosine(book_title, cosine_sim = cosine_sim):
    
    # Get the index of the book that matches the title
    book_idx = indices[book_title]
    
    # Get the pairwise similarity score of all the books with that book
    sim_book_scores = list(enumerate(cosine_sim[book_idx]))
    
    # Sort the books based on the similarity scores
    sim_book_scores = sorted(sim_book_scores, key=lambda x: x[1], reverse = True)
    
    # Get List of the the ten most similar books
    sim_book_scores = sim_book_scores[1:11]
    
    # Get the books indices
    books_indices = [i[0] for i in sim_book_scores]
    
    # Return ten similar books
    return books['Book_title'].iloc[books_indices]

In [8]:
recommendations_cosine('Introduction to Algorithms')

224                                     Algorithm Design
96     Effective Python: 59 Specific Ways to Write Be...
216                                           Algorithms
232                             Database System Concepts
222                                     Machine Learning
243    The Art of Computer Programming, Volume 1: Fun...
125                                          C# in Depth
238                              Game Programming Gems 2
163                       Data Structures and Algorithms
269    The Art of Computer Programming, Volumes 1-4a ...
Name: Book_title, dtype: object

| References  |             |
| :---    | :---    |
|Content-based Filtering Advantages & Disadvantages. (2018). Retrieved October 30, 2021, from Google Developers website: https://developers.google.com/machine-learning/recommendation/content-based/summary       |             |
|ibtesama. (2020, May 16). Getting Started with a Movie Recommendation System. Retrieved October 30, 2021, from Kaggle.com website: https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system
|       A Simple Introduction to Collaborative Filtering. (2019). Retrieved October 30, 2021, from Built In website: https://builtin.com/data-science/collaborative-filtering-recommender-system |
|A Simple Introduction to Collaborative Filtering. (2019). Retrieved October 30, 2021, from Built In website: https://builtin.com/data-science/collaborative-filtering-recommender-system |
|Advantages and disadvantages of autonomous vehicles. (2021, August 30). Retrieved October 30, 2021, from BBVA.CH website: https://www.bbva.ch/en/news/advantages-and-disadvantages-of-autonomous-vehicles/ |
|37 Examples of AI in Healthcare That Will Make You Feel Better About the Future. (2014). Retrieved October 30, 2021, from Built In website: https://builtin.com/artificial-intelligence/artificial-intelligence-healthcare |
|Artificial Intelligence in Finance [15 Examples]. (2021, April 2). Retrieved October 30, 2021, from University of San Diego website: https://onlinedegrees.sandiego.edu/artificial-intelligence-finance/ |
|             |             |  
|   dfsdf     |             |
|             |             |
|             |             |