# Data Science Programming Languages- DSAI 1303 
## Course Project: Sentiment Analysis of Twitter Data

Twitter has emerged as a fundamentally new instrument to obtain social measurements. For example, researchers have shown that the "mood" of communication on twitter can be used to predict the stock market. 

In this programming project you will:

* Load and prepare a collected set of twitter data for analysis
* You will estimate the sentiment associated with individual tweets
* You will estimate the sentiment of a particular term

Please keep in mind the following points:
* This assignment is open-ended in several ways. You will need to make some decisions about how to best solve each of the problems mentioned above.
* **It is absolutely fine to discuss your solutions with your classmates but you are not allowed to share code.**
* **Each student must submit their own solution via Google Classroom.**

## Formatting of Twitter Data

Strings in the twitter data prefixed with the letter "u" are unicode strings. For example: `u"This is a string"`.

Unicode is a standard for representing a mach larger variety of characters beyond the roma alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems, etc.).

In most circumstances, you will be able to use a unicode object just like a string.

If you encounter an error involving printing unicode, you can use the [encode](https://docs.python.org/3/library/stdtypes.html#str.encode) method to properly print the international characters. You can find more information about UNICODE and Python 3 [here](https://docs.python.org/3/howto/unicode.html).

# Question 1: Loading and Cleaning Twitter Data [20 points]

In this first part, you will neeed to load a sample of tweets in memory and prepare them for analysis. The tweets are stored in the file `tweets.json`. This file follows the *JSON* format. JSON stands for JavaScript Object Notation. It is a simple format for representing nested structres of data --- lists of lists of dictionaries of lists of ... you get the idea.

Each line in of `tweets.json` represents a message. It is straightforward to convert a JSON string into a Python data structure; there is a library to do so called `json`. Below we will show you how to load the data and how to parse the first line in the `tweets.json` file.

Each entry in `tweets.json`, i.e., each `tweet`, corresponds to a dictionary that contains lots of information about the tweet, the user, the activity related to the tweet (i.e., if it was retweeted or not), the timestamp of the tweet, entities mentioned in the tweet, hashtags used, etc.

You can treat the `tweet` variable from above as a dicitonary and use the `.keys()` command to see the fields associated with the dictionary.

We can select any of the aforemented values of Variable `tweet` by treating it as a dictionary. For example let's select the `text` body of the tweet, the time it was `created_at`, and the `hashtags` it contains.  

As you can see this tweet contains no hashtags. The body of the tweet contains several information that is not necesary for our sentiment analysis task. For example, it contains a comma, a reference to a twitter user and a link to an external website. 

Since this information is not necessary we can remove it. In other words we need to clean our input in order to prepare it for analysis. Next, we show you some basic cleaning operations using **regular expressions**. You can find more information on regular expressions [here](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285).

In [None]:
# Basic steps for cleaning process.
import re
import numpy
import pandas as pd
Data = pd.read_excel(r"C:\Users\pc\Downloads\مشروع البرمجة\DATA.xlsx")
def Twitter_filter(x):
    Del_User = re.sub('(@\w\w+)','User',x)
    Del_URL = re.sub('((https?)://(www)?\.?(\w+)\.(\w+):?(\d+)?/?(.+))',"Link",Del_User)
    return Del_URL
F_Data =Data.Text.apply(Twitter_filter)
data_filter = Data.filter(["Topic"])
Data = pd.concat([F_Data, data_filter], axis = 1)

def remove_emoji(string):
    emoji_pattern = re.compile("["
                            u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
                           "]+", flags=re.UNICODE)
    return (emoji_pattern.sub(r'', string)).lower()
Body =Data['Text'].apply(remove_emoji) 
print('Clean tweet body:')
print(Body)
file = open("clean_tweets.txt" , 'w')
file.write(str(Body))
file.close()

Clean tweet body:
0      user hello rose ! love the colors and her flo...
1      special delivery of flowers for user !!   hi ...
2      i was in the group on the left with the blue ...
3      if youre going to walk to the coffee shop hav...
4                           he cant be this funny more 
5      this is so funny have to be petty to crop the...
6                                  madam what is  funny
7      its funny to see how they stop talking about ...
8      user an a funny microphone  like regularly sc...
9      it is so fucking funny i was watching it infr...
10     my lil pussc be saying wtf she said no funny ...
11     user omg hey we never really talk but i think...
12     isnt it funny how day by day nothing changes ...
13     user it would be a pretty funny way to integr...
14            user hope he never sees one if his films 
15     just gooosebumps merely thinking of all the d...
16          user user excellent music , beautiful city 
17     i really want to clear 

We are providing you with a Python script named `preprocess.py`. The script `preprocess.py` accepts one argument on the command line: a JSON file with tweets (i.e., `tweets.json`). You can run the program like this:

`$ python3 preprocess.py tweets.json`

**There are some parts specified in this script that you need to implement**. The goal of this script is to clean all the tweets in `tweets.json`. Running `preprocess.py` will generate an output file named `clean_tweets.txt` containing **one string per line** containing a clean tweet. The order of the clean tweets in your output file should follow the order of the lines in the original `tweets.json`. Basically, the first line in `clean_tweets.txt` should correspond to the first raw tweet in `tweets.json`, the second line should correspond to the second tweet, and so on. If you perform any sorting or you put the processed data in a dictionary the order will not be preserved. Once again: **The n-th line of `clean_tweets.txt` (the file you will submit) should be a string that represent the clean version of the n-the line in the `tweets.json` (the input file).**

You must provide a line for **every** tweet. If the clean tweet is the empty string then just provide a line with the empty string.

***What to turn in: The file `clean_tweets.txt` output by `preprocess.py` after you have implemented the missing parts in `preprocess.py`.***

# Question 2: Derive the sentiment of each tweet [40 points]

For this part, you will compute the sentiment of each clean tweet in `clean_tweets.txt` based on the sentiment scores of the terms in the tweet. The sentiment of a tweet is equivalent to the sum of the sentiment scores for each term in the clean tweet.

You are provided with a skeleton file `tweet_sentiment.py` which accepts two arguments on the command line: a *sentiment file* and a tweet file like the one you generated in Question 1. You can run the skeleton program like this:

`$ python3 tweet_sentiment.py AFINN-111.txt clean_tweets.txt`

The file `AFINN-111.txt` contains a list of pre-computed sentiment scores. Each line in the file contains a word or phrase phollowed by a sentiment score. Each word or phrase that is found in a tweet but not found in `AFINN-111.txt` should be given a sentiment score of 0. See the file `AFINN-README.txt` for more information.

To use the data in the `AFINN-111.txt` file, you may find it useful to build a dictionary. Note that the `AFINN-111.txt` file format is tab-delimited, meaning that the term and the score are separated by a tab character. A tab character corresponds to the string "\t". The following snipped of code may be useful:

In [None]:
import pandas as pd
import numpy as np
afinnfile_name = open('AFINN-111.txt')
afinnfile = open('AFINN-111.txt', 'r')
scores = {} # initialize an empty dictionary
for line in afinnfile:
    term, score = line.split("\t") # The file is tab-delimited and "\t" means tab character
    scores[term] = int(score) # Conver the score to an integer. It was parsed as a string.
afinnfile.close()
#print(scores.items( )) # Print every (term, score) pair in the dictionary
Tweet_list = []
not_important = ['to','is','or','since','at','under','my','we','i','at','up','you','it','on','in','from','about','with','for','of','by','What','when','where','which'
                'who','whom','why','how','how many']
for tweet in Body :
    word_list = tweet.split(' ')
    Tweet = []
    for word in word_list:
        word = word.rstrip('.,?!:' '')
        Tweet.append(word)
    Tweet_list.append(Tweet)
val_exist = {}
val_not_exist = {}
for tweets in range(len(Tweet_list)) :
    Sum_Word =0
    for Word in Tweet_list[tweets]:
        if Word in list(scores.keys()):
            Sum_Word = Sum_Word + scores[Word]
            val_exist[tweets] = Sum_Word
            #print(r)
        if Word not in list(scores.keys()) and (Word != '') and ( Word not in not_important):
            val_not_exist[tweets] = Word
    
Twite_Sum = pd.Series(val_exist)  
file = open("sentiment.txt" , 'w')
file.write(str(Twite_Sum))
file.close()

Your script should output a file named `sentiment.txt` containing the sentiment of each tweet in the file `clean_tweets.txt`, one numeric sentiment score per line. The first score should correspond to the first tweet, the second score should correspond to the second tweet, and so on. In other words, ** the n-th line of the file you submit should contain only a single number that represents teh score of the n-th tweet in the input file.**

After you have implemented everything the first 10 lines of the generated output of your script should be exactly the same as the next lines:

```
0
0
0
0
0
1
2
-4
0
0
```

***What to turn in: The file `sentiment.txt` after you have verified that it returns the correct answers***

# Question 3: Derive the sentiment of new terms [40 points]

In this part you will create a script that computes the sentiment for terms that **do not** appear in the file `AFINN-111.txt`.

You can think about this problem as follows: We know we can use the sentiment-carrying words in `AFINN-111.txt` to deduce the overall sentiment of a tweet. Once you deduce the sentiment of a tweet, you can work backwards to deduce the sentiment of the non-sentiment carrying words that *do not appear* in `AFINN-111.txt`. For example, if the word *football* always appears in proximity with positive words like *great* and *fun*, then we can deduce that the term *football* itself carried a positive sentiment.

You are provided with a skeleton file `term_sentiment.py` which accepts the same two arguments as `tweet_sentiment.py` and can be executed using the following command:

`$ python3 term_sentiment.py AFINN-111.txt clean_tweets.txt`

Your script should print its output to stdout. Each line of the output should contain a term, followed by a space, followed by a sentiment. That is, each line should be in the format <term:string> <sentiment:float>. For example if you have the pair ("foo", 54.2) in Python, it should appear in the output as: `foo 54.2`.

*The order of your output does not matter.*

***What to turn in: The file `term_sentiment.py` after you have implemented the missing parts.***


In [None]:
put_values = {}
for tweet_sum in val_exist.values():
    if tweet_sum > 0 :
        for value in val_not_exist.values() :
            if len(value) > 3 :
                put_values[value]= 2
            if len(value) > 5:
                put_values[value] = 4
    if tweet_sum < 0 :
        for value in val_not_exist.values():
            if len(value) >= 3 :
                put_values[value]= -2
            if len(value) >= 5 :
                put_values[value] = -3 

In [None]:
for key in put_values.keys() :
    scores[key] = put_values[key]
for tweets in range(len(Tweet_list)) :
    Sum_Word =0
    for Word in Tweet_list[tweets]:
        if Word in list(scores.keys()):
            Sum_Word = Sum_Word + scores[Word]
            val_exist[tweets] = Sum_Word
print(val_exist)
file = open("term_sentiment.py" , 'w')
file.write(str(val_exist))
file.close()

{0: 10, 1: 12, 2: -2, 3: 2, 4: 6, 5: 4, 6: 6, 7: 5, 8: 10, 9: -6, 10: -5, 11: 8, 12: 4, 13: 9, 14: 6, 15: 9, 16: 12, 17: 8, 18: 4, 19: 7, 20: 9, 21: 5, 22: 3, 23: -8, 24: 2, 25: 7, 26: 2, 27: 1, 28: 6, 29: 2, 30: 1, 31: 8, 32: 9, 33: 8, 34: 9, 35: -3, 36: 8, 37: 10}
