# Sentiment Analysis

### Table of Contents

* [What is Data Science?](#I--What-is-Data-Science?)
    * [Statistics Descriptive](#1--Statistics-Descriptive)
    * [Statistics Inferential](#2--Statistics-Inferential)
    * [Data Mining](#3--Data-Mining)
    * [Machine Learning (overview)][ML]
* [Machine Learning](#II--Machine-Learning)
    * [Supervised](#1--Supervised)
    * [Unsupervised](#2--Unsupervised)
    * [Natural Language Processing (NLP)][NLP]
* [Sentiment Analysis](#III--Sentiment-Analysis)
    * [Principal Data Cleaning Process (NLTK)][NLTK]
    * [Algorithms Implemented](#2--Algorithms-Implemented)
    * [Libraries Used For NLP](#3--Libraries-Used-For-NLP)
* [Project Implementation](#IV--Project-Implementation)
* [Conclusion](#V--Conclusion)
* [Links](#Links)


[ML]: #4--Machine-Learning-(overview)
[NLP]: #3--Natural-Language-Processing-(NLP)
[NLTK]: #1--Principal-Data-Cleaning-Process-(NLTK)

## I- What is Data Science?
<br></br>
![DS.png](attachment:DS.png)
<br>
First of all, **data science** is a method of providing **actionable intelligence from data using math, statistics, programming, and business expertise.** Like any scientific method, it involves gathering data, identifying a problem, forming a hypothesis, and running tests. More specifically, data scientists follow a process of gathering and cleaning data _(wrangling)_, investigation _(exploratory data analysis)_, building automation using machine learning _(feature engineering, model development, and deployment)_, delivering results _(visualizations, reporting, storytelling)_, and maintenance. Practitioners typically spend **70-80%** of their time in the wrangling/exploration, **20%** on machine learning models, and the rest in maintenance. Most importantly, this whole process should result in a valuable action or insight for the end-user, i.e. a business or customer!

### 1- Statistics Descriptive
<br></br>
* **Descriptive statistics** are used to **describe the basic features of the data** in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.
* **Descriptive Statistics** are used to present **quantitative descriptions in a manageable form**. In a research study we may have lots of measures. Or we may measure a large number of people on any measure. Descriptive statistics help us to simplify large amounts of data in a sensible way.

### 2- Statistics Inferential
<br></br>
* **Statistics Inferential** we are trying to **reach conclusions that extend beyond the immediate data alone.** For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions

### 3- Data Mining
![data_mining.png](attachment:data_mining.png)

**Data mining** is the **process of finding anomalies, patterns and correlations within large data sets to predict outcomes**. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more.
So **Data Mining is about finding the trends in a data set. And using these trends to identify future patterns.**

### Data Mining Vs Data Science
<br></br>
<ul>
<li>Data Mining is an activity which is a part of a broader Knowledge Discovery in Databases (KDD) Process while Data Science is a field of study just like Applied Mathematics or Computer Science.</li>
<li>Often Data Science is looked upon in a broad sense while Data Mining is considered a niche.</li>
<li>Some activities under Data Mining such as statistical analysis, writing data flows and pattern recognition can intersect with Data Science. Hence, Data Mining becomes a subset of Data Science.</li>
<li>Machine Learning in Data Mining is used more in pattern recognition while in Data Science it has a more general use.</li>
</ul>
<br></br>
<table style="height: 684px" width="792">
<tbody>
<tr>
<td width="105"><strong>Basis for comparison</strong></td>
<td style="text-align: center" width="245"><strong>Data Mining</strong></td>
<td style="text-align: center" width="274"><strong>Data Science</strong></td>
</tr>
<tr>
<td width="105"><strong>What is it?</strong></td>
<td width="245">A technique</td>
<td width="274">An area</td>
</tr>
<tr>
<td width="105"><strong>Focus</strong></td>
<td width="245">Business process</td>
<td width="274">Scientific study</td>
</tr>
<tr>
<td width="105"><strong>Goal</strong></td>
<td width="245">Make data more usable</td>
<td width="274">Building Data-centric products for an organization</td>
</tr>
<tr>
<td width="105"><strong>Output</strong></td>
<td width="245">Patterns</td>
<td width="274">Varied</td>
</tr>
<tr>
<td width="105"><strong>Purpose</strong></td>
<td width="245">Finding trends previously not known</td>
<td width="274">Social analysis, building predictive models, unearthing unknown facts, and more</td>
</tr>
<tr>
<td width="105"><strong>Vocational Perspective</strong></td>
<td width="245">Someone with a knowledge of navigating across data and statistical understanding can conduct data mining</td>
<td width="274">A person needs to understand Machine Learning, Programming, info-graphic techniques and have the domain knowledge to become a data scientist</td>
</tr>
<tr>
<td width="105"><strong>Extent</strong></td>
<td width="245">Data mining can be a subset of Data Science as Mining activities are part of the Data Science pipeline</td>
<td width="274">Multidisciplinary –&nbsp; Data Science consists of Data Visualizations, Computational Social Sciences, Statistics, Data Mining, Natural Language Processing, et cetera</td>
</tr>
<tr>
<td width="105"><strong>Deals with (the type of data)</strong></td>
<td width="245">Mostly structured</td>
<td width="274">All forms of data – structured, semi-structured and unstructured</td>
</tr>
<tr>
<td width="105"><strong>Other less popular names</strong></td>
<td width="245">Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction</td>
<td width="274">Data-driven Science</td>
</tr>
</tbody>
</table>

### 4- Machine Learning (overview)
<br></br>
![AI.png](attachment:AI.png)
<br></br>
* **Machine learning (ML)** is the study of computer algorithms that improve automatically through experience.It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.
* **Machine learning** is closely related to computational statistics, which focuses on making predictions using computers.
* The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine correct answers.

## II- Machine Learning
<br></br>
![machine-learning.png](attachment:machine-learning.png)

In [1]:
import IPython
IPython.display.IFrame('https://www.mindmeister.com/maps/public_map_shell/1626104926/machine-learning-algorithms?width=850&height=450&z=auto&presentation=1',width=850,height=450)

### 1- Supervised
<br></br>
**Supervised learning** is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object _(typically a vector)_ and a desired output value _(also called the supervisory signal)_.
<br></br>
![supervised.png](attachment:supervised.png)

#### a- Regression

##### Linear regression:
**linear regression** is a **linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables)**. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.
<br></br>
![regressioncurv.png](attachment:regressioncurv.png)

##### Polynomial regression:
**polynomial regression** is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an **nth degree polynomial in x**. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.
<br></br>
![polynomial-reg.png](attachment:polynomial-reg.png)

#### b- Classification
<br></br>
**Classification** is a **process of categorizing a given set of data into classes**, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.

The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify **which class/category** the new data will fall into.
![classification.png](attachment:classification.png)
<br></br>
<br></br>
##### Logistic regression:

<br></br>
##### Decision Tree:


### 2- Unsupervised
<br></br>
**Unsupervised learning** is a type of machine learning that looks for previously **undetected patterns in a data set with no pre-existing**. so when a dataset is provided without labels the model learns useful properties of the structure of the dataset and come with a patterns or conclusions from the unlabeled data.
<br></br>
![unsupervised.png](attachment:unsupervised.png)

#### a- Clustering

##### k-means:
**k-means clustering** is a method of **vector quantization**, originally from signal processing, that aims to **partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean** (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
<br></br>
![k-means.gif](attachment:k-means.gif)


##### hierarchical:
**Hierarchical clustering**, also known as hierarchical cluster analysis, is an algorithm that **groups similar objects into groups called clusters**. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
<br></br>
![hierarch.gif](attachment:hierarch.gif)


### 3- Natural Language Processing (NLP)

## III- Sentiment Analysis

### 1- Principal Data Cleaning Process (NLTK)

### 2- Algorithms Implemented

##### K-Nearest Neighbors:

##### Naive Bayes:

![ML1.png](attachment:ML1.png)

![ML2.png](attachment:ML2.png)

![ML3.png](attachment:ML3.png)

![ML4.png](attachment:ML4.png)

![ML5.png](attachment:ML5.png)

![ML6.png](attachment:ML6.png)

### 3- Libraries Used For NLP

## IV- Project Implementation

In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Toggle Code"></form>''')

## V- Conclusion 

data source: [kaggle](https://www.kaggle.com/arkhoshghalb/twitter-sentiment-analysis-hatred-speech), [wikipedia](https://en.wikipedia.org/), [github](https://github.com/), [codeup](https://codeup.com/), [medium](https://medium.com/), [besanttechnologies](https://www.besanttechnologies.com/), [conjointly](https://conjointly.com/), [sas](https://www.sas.com/), [educba](https://www.educba.com/), [edureka](https://www.edureka.co/)

## Links

[![](https://img.shields.io/badge/My-Portfolio-brightgreen)](https://salah-zkara.codes/)

[![](https://img.shields.io/badge/-Facebook-%234267B2)](https://www.facebook.com/salaheddine.zkara.9)

[![](https://img.shields.io/badge/-Linkedin-%232867B2)](https://www.linkedin.com/in/salah-eddine-zkara-b40b091a6/)

[![](https://img.shields.io/badge/-Twitter-%231DA1F2)](https://twitter.com/SalahZkara)

[![](https://img.shields.io/badge/-Github-333)](https://github.com/Salah-Zkara)

[![](https://img.shields.io/badge/-Instagram-%23E1306C)](https://www.instagram.com/salaheddine.zkara/?hl=en)
![GitHub followers](https://img.shields.io/github/followers/Salah-Zkara?style=social)


#### Supervisor: Guezaz Azidine