Introduction

Monitoring and responding to customer feedback is an essential first step, yet companies don't always take the time to analyze these valuable elements. Indeed, who better than a person who has had a positive or negative experience with a product or service to give their opinion? Today, with the explosion of the Internet, social networks and smartphones, the totality of user opinions represents an important mass of information.

In the case of mobile applications, every day millions of users share their thoughts and criticisms on Google Play and the Apple Store. Users then express their feelings and feedback after using the application. Faced with this mass of data, traditional marketing studies and techniques are now outdated. New techniques must be adopted to automate and optimize the analysis of user feedback...

Users reviews allow a better understanding of the consumption habits and uses of the products offered by the company. They also highlight the positive and negative points of the customer journey. They are therefore very valuable data that enriches the company with knowledge about its current and future customers.

This project consist in 3 distinct parts :

Web Scrapping
Sentiment Analysis
Topics Modeling

Web Scrapping :

The first step in our project was extraction of user data from the Google Play Store. This will be done using Web Scrapping. The objective is to extract the content of a page from a site in a structured way. The main interest of Web Scrapping is to be able to harvest content from a website, which cannot be copied and pasted without distorting the very structure of the document. For this project, I wrote a Python script to perform Scrapping of user data and storage of these data in a structured form, which is a csv file.

The web scraping script was achieved using the BeautifulSoup and Selenium modules. The extracted data are stored in the "scraped_reviews" folder. For example, I scraped 10.000 reviews from Android applications like Instagram, Facebook, Netflix, etc

There is 7 columns inside the csv :

user_name : Username of the Google account
date : Date of the reviews
num_stars : Number of stars of the review (1 to 5)
review : Textual review of the user
num_likes : Number of likes the review received from other users
user_name_answer : user_name of the ansewr
date_answer : Date of the answer
answer : Textual content of the answer

Topics Modeling with Latent Dirichlet Allocation :

Topîc modeling is a text mining model, using unsupervised and supervised statistical machine learning techniques to identify themes in a corpus or large amount of unstructured text. From a collection of documents, the model will group words into word clusters, identifying topics, through a process based on similarity.

Latent Dirichlet allocation is a popular model for fitting a subject model. It treats each document as a mixture of topics and each topic as a mixture of words. This allows documents to "overlap" in terms of content, rather than being separated into distinct groups, in a way that reflects typical natural language usage.

2-dimensional visualization of the Topics Modeling model :

We can perform a two-dimensional visualization of the topics extracted from the scrapped dataset. To do so, we use the library which is available on Python. pyLDAvis is a Python library for interactive LDA visualization. Below,we can see a figure which is a screenshot of the 2D visualization of the LDA model I implemented:

The size of the circle represents the importance of each topic on the whole corpus, the distance between the center of the circles indicates the similarity between the topics. For each topic, the histogram on the right side lists the 30 most relevant terms. LDA helped me to extract 7 main topics.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
doc		doc
scraped_reviews		scraped_reviews
web_driver		web_driver
LICENSE		LICENSE
README.md		README.md
ReviewsExtraction.py		ReviewsExtraction.py
Topics Modeling.ipynb		Topics Modeling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Web Scrapping :

Topics Modeling with Latent Dirichlet Allocation :

2-dimensional visualization of the Topics Modeling model :

About

Languages

License

AmineAgrane/Web-Scraping-and-Topics-Modeling-Android-AppStore

Folders and files

Latest commit

History

Repository files navigation

Introduction

Web Scrapping :

Topics Modeling with Latent Dirichlet Allocation :

2-dimensional visualization of the Topics Modeling model :

About

Topics

Resources

License

Stars

Watchers

Forks

Languages