# DCS World Campaign Recommender v0.1 
- Donghang Wu ©2022

### Project Goals
The goal of this project is to help DCS users to browse the ever growing campaign DLC list and recommend 
appropriate campaigns for their next purchase.

The initial version of the recommender will be based on a NLP implementation using solely the product info provided on 
DCS WORLD website (just like how a user receives info), with collaborative filtering added later when data becomes available.

This project will first extract product info from [DCS-WORLD.com](https://www.digitalcombatsimulator.com) and
save as csv file form, ready for general analysis and other product usage.

To use the data on `dcs_recommender`, one first need to clean unnecessary text parts and concatenate text to form 'tags' column, a **clean** csv with proper tags aggregated is needed for `dcs_recommender` to process.
 

Here are a few questions that this project has sought to answer:
- Can the recommender accurately recommend topic related/author related campaign (content based)?
- How are those results compared to a *veteran player recommendation*?
- How can the recommender take in other domain knowledges that *veteran players* are aware of (such as gameplay quality, type of missions within, mission length, etc.) ?
- Will `dcs_recommender` perform better than the current *popular* recommendation presented on DCS website in sales?

### Data sources

The base product info was provided by [DCS-WORLD.com](https://www.digitalcombatsimulator.com).

*future purchase related data may be added from Eagle Dynamics*


### Evaluation

For v0.1, the simple NLP implementation, `dcs_recommender` is able to recommend campaigns that are strongly related in true content, such as *training campaigns*, *Red Flag campaigns*, and *campaigns with same author* (author names are not included in the tags, so this is actually amazing!)

However, there are some siginificant downsides. Since `dcs_recommender` recommends compaigns that date years before and the campaign's build and production quality are simply not comparable to recent ones. Needless to say that old campaigns may be filled with bugs and being poorly maintained, giving players a poor experience.

To alleviate this issue, I will add weightings based on campaign publishing date, Steam user reviews and other public available data. The result should favor less on older, more negatively reviewed campaigns, thus completing the 'content filtering' part of the recommender.

I expect the model to perform fairly well once 'collaborative filtering' is added, and will present to DCS veteran player groups for evaluation before the official launch.


## Table of Contents

### - Step 1: Web Scraping and data gathering
- 1-1: [Scrape product links from main website](#1-1)
- 1-2: [Scrape product detailed info from product pages and create `dcs_campaign_data.csv`](#1-2)

### - Step 2: Constructing Dataframe and setup data for model building
- 2-1: [Loading `dcs_campaign_data.csv` as `campaign_data` and some basic cleaning](#2-1)
    - [Dealing with missing values](#2-1_NaN)
- 2-2: [Setting up dataframe for vectorization/nltk](#2-2)
    

### - Step 3: Model building and testing
- 3-1 [Creating vectorizer and Vectoring `'tags'` to 2D array](#3-1)
- 3-2 [Stemming `'tags'` to improve NLP accuracy](#3-2)
- 3-3 [Final Recommender function creation and testings](#3-3)

### - Step 4: Pipeline creation and reusability
- 4-1 [Creating the `dcs_recommender` class and relating functions](#4-1)
- 4-2 [Testing `dcs_recommender` and its methods](#4-2)

### FUTURE TO DO LIST:

1. improve model by adding module_requirements
1. adding price, review, and other quantative data and clustering models to reinforce the reommender (collaborative filtering)
1. taking into account that older campaigns are less maintained, giving them less of a weight in choosing older campaigns
1. more robust, proxi matching system, that user only need to enter partial name of the campaign
1. implement recommender on other modules such as planes and maps
1. add a GUI for user accessibility

## Step 1-1: Parse product links from main website <a name="1-1"></a>

## Step 1-2 Parse product detailed info from product pages <a name="1-2"></a>

## Step 2-1 Loading dataframe and basic cleaning <a name="2-1"></a>

In [None]:
# data related package loading
import pandas as pd
import numpy as np

# model related package loading
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity


# there will be some regex warnings coming up when dealing with strings in df, ignoreed for presentation purpose
import time
import warnings
warnings.filterwarnings('ignore') # comment out this line to see warnings


### Loading data from `dcs_campaign_data.csv`

- We will **rename** the first column as `'campaign_id'`
- Since `'name'` column has author names in them, we will `extract` them and store in a column called `'author'`
- **compare** before and after info on `campaign_data` dataframe to confirm existence of `'author'`

In [None]:
campaign_data = pd.read_csv('dcs_campaign_data.csv')
campaign_data.columns.values[0] = 'campaign_id'

print(campaign_data.info())

# Extract author to constrcut 'author' column, Strip 'DCS:' from the titles (WARNING: only do ONCE)
campaign_data['author'] = campaign_data['name'].str.extract('by(.*)')

# Since name of author are left out of tags for now, we will leave them in titles to assess the effectiveness of the model
campaign_data['name'] = campaign_data['name'].str.replace('DCS:', '').str.strip().str.extract('(.*)')

# print(campaign_data.head(5))
print(campaign_data.info())

### Dealing with missing values <a name="2-1_NaN"></a>

- There is one product's `'description'` NaN due to a out-dated product, thus DROP the entire `row`
- `NaN` in `campaigndata['author']`  means the campaign is made by Eagle Dynamics, thus fillna with `'Eagle Dynamics'`

In [None]:
# One description NaN due to a out-dated product, thus DROP
# author NaN means the campaign is made by Eagle Dynamics, thus fill 'Eagle Dynamics'
campaign_data['author'] = campaign_data['author'].fillna(value='Eagle Dynamics')
campaign_data = campaign_data.dropna()
campaign_data.info()

# Dataset is clean (Some datapoint may miss 'key_features', meaning the author is too lazy to advertise, no hard feelings)

## Step 2-2 Setting up dataframe for vectorization/nltk <a name="2-2"></a>

1. remove all *comma*, *colon*, *period* and other non-word signs from `campaign_data`
1. break sentences into words and store as a *lists* in *columns*, ready for **concatenation**
1. **spaces** between *phrases* and *names* are removed for **CONCAT & NLP** purposes

In [None]:
# Strip all comma, period, colon from columns
campaign_data['description'] = campaign_data['description'].str.replace(',', '').str.replace('.', '').str.replace(':', '')
campaign_data['key_features'] = campaign_data['key_features'].str.replace(',', '').str.replace('.', '').str.replace(':', '')
campaign_data['voice_over_loc'] = campaign_data['voice_over_loc'].str.replace(',', '').str.replace('.', '').str.replace(':', '')
campaign_data['subtitle_loc'] = campaign_data['subtitle_loc'].str.replace(',', '').str.replace('.', '').str.replace(':', '')

# Some weird stuff needed to be removed
campaign_data['key_features'] = campaign_data['key_features'].str.replace('[', '').str.replace(']', '').str.replace('(', '').str.replace(")", "")

# All splits needed prior to tag merge
campaign_data['description'] = campaign_data['description'].apply(lambda x: x.split())
campaign_data['key_features'] = campaign_data['key_features'].apply(lambda x: ''.join(x))
campaign_data['key_features'] = campaign_data['key_features'].apply(lambda x: x.split())
campaign_data['voice_over_loc'] = campaign_data['voice_over_loc'].apply(lambda x: x.split())
campaign_data['subtitle_loc'] = campaign_data['subtitle_loc'].apply(lambda x: x.split())

# All space removing prior to tag merge
campaign_data['module_requirements'] = campaign_data['module_requirements'].str.replace(' ', '')
campaign_data['author'] = campaign_data['author'].str.replace(' ', '')

campaign_data.head(2)

### Concatenate to form `'tags'` column in new df `campaign_data_model`

- There are some issues with `author` and `module_requirements` features, they are **excluded** for now

In [None]:
#campaign_data['module_requirements'] and campaign_data['author'] have some datatype issues needed to be solved

campaign_data['tags'] = campaign_data['description']+campaign_data['key_features']+campaign_data['voice_over_loc']+campaign_data['subtitle_loc']

campaign_data_model = campaign_data[['campaign_id','name','tags']]
campaign_data_model['tags'] = campaign_data_model['tags'].apply(lambda x:' '.join(x))

# A quick check for comfort
print(campaign_data_model.head(3))

In [None]:
# lower case all words in tags for NLP preparation
campaign_data_model['tags'] = campaign_data_model['tags'].apply(lambda X: X.lower())

campaign_data_model.to_csv('dcs_campaign_data_tagged.csv', index=False)
# check-up
campaign_data_model['tags'][0]

## 3-1 Creating vectorizer and Vectoring `'tags'` to 2D array <a name="3-1"></a>

In [None]:
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(campaign_data_model['tags']).toarray()

# verify shape and # of features
vectors.shape

## 3-2 Stemming `'tags'` to improve NLP accuracy <a name="3-2"></a>

In [None]:
ps = PorterStemmer()
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return ' '.join(y)

In [None]:
# apply to `tags`
campaign_data_model['tags'] = campaign_data_model['tags'].apply(stem)

# create similarity matrix using cos similarity
similarity = cosine_similarity(vectors)

In [None]:
# Manual check-up to see if model is behaving as intended, showing top 5 closest 'recommendations' with indexes
sorted(list(enumerate(similarity[0])), reverse=True, key=lambda x: x[1])[1:6]

## 3-3 Final Recommender function creation and testings <a name="3-3"></a> 

In [None]:
def recommend(campaign):
    campaign_index = campaign_data_model[campaign_data_model['name'] == campaign].index[0]
    distances = similarity[campaign_index]
    campaign_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:6]
    
    for i in campaign_list:
        print(campaign_data_model.iloc[i[0]]['name'])

In [None]:
# Testing
recommend("MAD JF-17 Thunder Campaign by Stone Sky")

## 4-1 Creating the `dcs_recommender` class and relating functions <a name="4-1"></a> 

In [4]:
from dcs_recommender import DCSRecommender

DCSRecommender?

[0;31mInit signature:[0m [0mDCSRecommender[0m[0;34m([0m[0mdataset[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
This is a experimental non-commercial project by Donghang WU
    - The project intend to recommend relevant campaigns to players using
    the PorterStemmer NLP algorithm as a content-based processor
    
    - GUI and other front-end application will be added for UX and first stage implementation
    - Collaborative filtering may be added if data becomes available



The recommender takes in 1 dataset argument

**dataset must be prepared for nltk,
all features must been concatenated in 'tags' column form and lowercased

There are two functions currently available:

- choose_campaign: allow user to choose desired campaign from list 

- *recommend: recommend campaign using user choice or direct input 
  (input must be exact excluding "DCS:" from the title)

PACKAGE REQUIREMENTS:
    import pandas as pd
    import time
    from sklearn.feature_extr

## 4-2 Testing `dcs_recommender` and its methods <a class='4-2'></a> 

In [1]:
import pandas as pd
import time
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity

campaign_data_model = pd.read_csv('dcs_campaign_data_tagged.csv')
rec = DCSRecommender(campaign_data_model)
campaign_data_model['tags'] = campaign_data_model['tags'].apply(rec.stem)

------ Welcome to the DCS Campaign Recommender! ------ 
 You can use "choose_campaign" method to choose an campaign from our list 
 Already knew what you own/want? 
 Try using "recommend" method and input exact campaign name as argument 
 Our mighty R2D2 will recommend some campaigns for you to purchase in the future!


In [11]:
rec.choose_campaign()

Is there a particular airframe era you are looking for?
Please enter "yes" or "no"!
yes
What is the airframe era (ww2, coldwar, modern, other)?
ohter
You have 3 chances left
Please enter a valid kind!
other
This is the list of other campaigns that has no module name
15       Ka-50 2 Pandemic Campaign by Armen Murazyan
6                          MAD Campaign by Stone Sky
47             The Border Campaign by Armen Murazyan
31    The Enemy Within 3.0 Campaign by Baltic Dragon
51            The Museum Relic Campaign by Apache600
Name: name, dtype: object
Please make your choice:
MAD Campaign by Stone Sky
You have chosen MAD Campaign by Stone Sky,
 that is a great choice!


In [12]:
rec.recommend()

Based on your choice, we also recommend the following campaign
MAD JF-17 Thunder Campaign by Stone Sky
F/A-18C Rise of the Persian Lion Campaign by Badger 633
Spitfire LF Mk. IX Operation Epsom Campaign by B&W Campaigns
P-51D Charnwood Campaign by B&W Campaigns
A-10C Operation Persian Freedom Campaign by Ground Pounder Sims


In [8]:
a = 'A-10C Basic Flight Training Campaign by Maple Flag Missions'
b = 'Su-27 The Ultimate Argument Campaign'
rec.recommend('MiG-21bis Battle of Krasnodar Campaign by SorelRo')

Based on your choice, we also recommend the following campaign
F-5E Black Sea Resolve '79 Campaign by SorelRo
P-51D Charnwood Campaign by B&W Campaigns
AV-8B Hormuz Freedom Campaign by SorelRo
The Enemy Within 3.0 Campaign by Baltic Dragon
Spitfire LF Mk. IX Operation Epsom Campaign by B&W Campaigns
