### Finding your Python Path

Your Python path can be displayed using the built-in os module. The OS module is for operating system dependent functionality into Python programs and scripts.

To find your current working directory, the function required is os.getcwd(). The  os.listdir() function can be used to display all files in a directory, which is a good check to see if the CSV file you are loading is in the directory as expected.

```
# Find out your current working directory
import os
print(os.getcwd())

# Out: /Users/shane/Documents/blog

# Display all of the files found in your current working directory
print(os.listdir(os.getcwd())


# Out: ['test_delimted.ssv', 'CSV Blog.ipynb', 'test_data.csv']
```

Instead of moving the required data files to your working directory, you can also change your current working directory to the directory where the files reside using `os.chdir()`.

### Advanced Read CSV Files

There are some additional flexible parameters in the Pandas read_csv() function that are useful to have in your arsenal of data science techniques:

Specifying Data Types
As mentioned before, CSV files do not contain any type information for data. Data types are inferred through examination of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, the dtype parameter can be used with a dictionary of column names and data types to be applied, for example: 

`dtype={"name": str, "age": np.int32}`

Note that for dates and date times, the format, columns, and other behaviour can be adjusted using parse_dates, date_parser, dayfirst, keep_date parameters.

There is a parse_dates parameter for read_csv which allows you to define the names of the columns you want treated as dates or datetimes:
```
date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)
```

In [27]:
import pandas as pd

pdf2 = pd.read_csv('output3.csv')
pdf2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text             6510 non-null   object
 1   favourite_count  6510 non-null   int64 
 2   retweet_count    6510 non-null   int64 
 3   created_at       6510 non-null   object
dtypes: int64(2), object(2)
memory usage: 203.6+ KB


My workaround was to load as its default type, then use pandas.to_datetime() function one line down.

`df[target_col] = pd.to_datetime(df[target_col])`

In [128]:
pdf2['created_at'] = pd.to_datetime(pdf2['created_at'])
pdf2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   text             6510 non-null   object        
 1   favourite_count  6510 non-null   int64         
 2   retweet_count    6510 non-null   int64         
 3   created_at       6510 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 203.6+ KB


In [118]:
import pandas as pd

pdf2 = pd.read_csv('output3.csv', parse_dates = ['created_at',])
pdf2.head(5)

Unnamed: 0,text,favourite_count,retweet_count,created_at
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33
2,RT @TimWattsMP: Can you believe this?\n\nThe M...,0,212,2021-06-20 23:58:46
3,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20 23:57:54
4,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20 23:56:29


In [119]:
pdf2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   text             6510 non-null   object        
 1   favourite_count  6510 non-null   int64         
 2   retweet_count    6510 non-null   int64         
 3   created_at       6510 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 203.6+ KB


# Get Tweets

This script extracts all the tweets with hashtag #covid-19 related to the day before today (yesterday) and saves them into a .csv file.
We use the `tweepy` library, which can be installed with the command `pip install tweepy`.

Firstly, we import the configuration file, called `config.py`, which is located in the same directory of this script.

In [1]:
import os
jv = os.environ.get('JAVA_HOME', None)

import findspark
findspark.init()

os.environ['PYSPARK_SUBMIT_ARGS'] = \
'--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.0 pyspark-shell'
# '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

sys.path


['/opt/spark/python',
 '/opt/spark/python/lib/py4j-0.10.9-src.zip',
 '/home/hadoopuser/Documents/twitter_search_api',
 '/home/hadoopuser/.vscode/extensions/ms-toolsai.jupyter-2021.6.999662501/pythonFiles/vscode_datascience_helpers',
 '/home/hadoopuser/.vscode/extensions/ms-toolsai.jupyter-2021.6.999662501/pythonFiles',
 '/home/hadoopuser/.vscode/extensions/ms-toolsai.jupyter-2021.6.999662501/pythonFiles/lib/python',
 '/opt/anaconda/envs/pyspark_env/lib/python37.zip',
 '/opt/anaconda/envs/pyspark_env/lib/python3.7',
 '/opt/anaconda/envs/pyspark_env/lib/python3.7/lib-dynload',
 '',
 '/opt/anaconda/envs/pyspark_env/lib/python3.7/site-packages',
 '/opt/anaconda/envs/pyspark_env/lib/python3.7/site-packages/IPython/extensions',
 '/home/hadoopuser/.ipython']

In [2]:
import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))

In [4]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark=SparkSession.builder \
.appName("Spark_and_Pandas_twitter_dfs") \
.master("local[*]") \
.getOrCreate()

In [5]:
spark

In [6]:
from config import *
import tweepy
import datetime

import sys
import logging

logger = logging.getLogger('tweets_search')

In [7]:
import pandas as pd
# import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [8]:
print(f"logger.root.level = {logger.root.level}, logger.root.name = {logger.root.name}")
print(f"logger.name = {logger.name}")

logger.root.level = 30, logger.root.name = root
logger.name = tweets_search


In [9]:
format = "%(asctime)s - %(levelname)s - %(message)s"
logging.basicConfig(format=format, stream=sys.stdout, level = logging.DEBUG)
# logging.basicConfig(format=format, stream=sys.stdout, level = logging.INFO)

In [10]:
print(logger.root.level)

10


In [17]:
# logger.root.level = 10

In [18]:
# print(logger.root.level)

10


We setup the connection to our Twitter App by using the `OAuthHandler()` class and its `access_token()` function. Then we call the Twitter API through the `API()` function.

In [11]:
auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET)
auth.set_access_token(TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_TOKEN_SECRET)
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify = True)

In [11]:
api.me()

2021-06-24 23:14:44,448 - DEBUG - PARAMS: {}
2021-06-24 23:14:44,450 - DEBUG - Signing request <PreparedRequest [GET]> using client <Client client_key=pw0ihLFxH3nwDrd4HBd7pqUrc, client_secret=****, resource_owner_key=1360011857969479682-iLrxBUlqdtExwkqiN9iZsHYDXIFTZz, resource_owner_secret=****, signature_method=HMAC-SHA1, signature_type=AUTH_HEADER, callback_uri=None, rsa_key=None, verifier=None, realm=None, encoding=utf-8, decoding=None, nonce=None, timestamp=None>
2021-06-24 23:14:44,451 - DEBUG - Including body in call to sign: False
2021-06-24 23:14:44,455 - DEBUG - Collected params: [('oauth_nonce', '111288902889639547901624565684'), ('oauth_timestamp', '1624565684'), ('oauth_version', '1.0'), ('oauth_signature_method', 'HMAC-SHA1'), ('oauth_consumer_key', 'pw0ihLFxH3nwDrd4HBd7pqUrc'), ('oauth_token', '1360011857969479682-iLrxBUlqdtExwkqiN9iZsHYDXIFTZz')]
2021-06-24 23:14:44,458 - DEBUG - Normalized params: oauth_consumer_key=pw0ihLFxH3nwDrd4HBd7pqUrc&oauth_nonce=1112889028896395

User(_api=<tweepy.api.API object at 0x7f6253980850>, _json={'id': 1360011857969479682, 'id_str': '1360011857969479682', 'name': 'vasiange', 'screen_name': 'vasiange', 'location': '', 'profile_location': None, 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 0, 'friends_count': 15, 'listed_count': 0, 'created_at': 'Thu Feb 11 23:45:06 +0000 2021', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 0, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png', 'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.p

In [12]:
api.rate_limit_status()

2021-06-24 23:14:55,395 - DEBUG - PARAMS: {}
2021-06-24 23:14:55,396 - DEBUG - Signing request <PreparedRequest [GET]> using client <Client client_key=pw0ihLFxH3nwDrd4HBd7pqUrc, client_secret=****, resource_owner_key=1360011857969479682-iLrxBUlqdtExwkqiN9iZsHYDXIFTZz, resource_owner_secret=****, signature_method=HMAC-SHA1, signature_type=AUTH_HEADER, callback_uri=None, rsa_key=None, verifier=None, realm=None, encoding=utf-8, decoding=None, nonce=None, timestamp=None>
2021-06-24 23:14:55,398 - DEBUG - Including body in call to sign: False
2021-06-24 23:14:55,399 - DEBUG - Collected params: [('oauth_nonce', '126286861049097242631624565695'), ('oauth_timestamp', '1624565695'), ('oauth_version', '1.0'), ('oauth_signature_method', 'HMAC-SHA1'), ('oauth_consumer_key', 'pw0ihLFxH3nwDrd4HBd7pqUrc'), ('oauth_token', '1360011857969479682-iLrxBUlqdtExwkqiN9iZsHYDXIFTZz')]
2021-06-24 23:14:55,400 - DEBUG - Normalized params: oauth_consumer_key=pw0ihLFxH3nwDrd4HBd7pqUrc&oauth_nonce=1262868610490972

reset': 1624566596}},
  'guide': {'/guide': {'limit': 180, 'remaining': 180, 'reset': 1624566596},
   '/guide/get_explore_locations': {'limit': 100,
    'remaining': 100,
    'reset': 1624566596},
   '/guide/explore_locations_with_autocomplete': {'limit': 200,
    'remaining': 200,
    'reset': 1624566596}},
  'auth': {'/auth/csrf_token': {'limit': 15,
    'remaining': 15,
    'reset': 1624566596}},
  'blocks': {'/blocks/list': {'limit': 15,
    'remaining': 15,
    'reset': 1624566596},
   '/blocks/ids': {'limit': 15, 'remaining': 15, 'reset': 1624566596}},
  'geo': {'/geo/similar_places': {'limit': 15,
    'remaining': 15,
    'reset': 1624566596},
   '/geo/place_page': {'limit': 75, 'remaining': 75, 'reset': 1624566596},
   '/geo/id/:place_id': {'limit': 75, 'remaining': 75, 'reset': 1624566596},
   '/geo/reverse_geocode': {'limit': 15, 'remaining': 15, 'reset': 1624566596},
   '/geo/search': {'limit': 15, 'remaining': 15, 'reset': 1624566596}},
  'users': {'/users/': {'limit': 900,

Now we setup dates. We need to setup today and yesterday.

In [24]:
today = datetime.date.today()
since= today - datetime.timedelta(days=7)
until= today
until, since
# (datetime.date(2021, 6, 7), datetime.date(2021, 6, 6))

(datetime.date(2021, 6, 25), datetime.date(2021, 6, 18))

In [67]:
    logger.debug(f"full_text: '{until, since}'")

2021-06-25 01:50:41,724 - DEBUG - full_text: '(datetime.date(2021, 6, 25), datetime.date(2021, 6, 18))'


We search for tweets on Twitter by using the `Cursor()` function. 
We pass the `api.search` parameter to the cursor, as well as the query string, which is specified through the `q` parameter of the cursor.
The query string can receive many parameters, such as the following (not mandatory) ones:
* `from:` - to specify a specific Twitter user profile
* `since:` - to specify the beginning date of search
* `until:` - to specify the ending date of search
The cursor can also receive other parameters, such as the language and the `tweet_mode`. If `tweet_mode='extended'`, all the text of the tweet is returned, otherwise only the first 140 characters.

In [None]:
# # example 
# code tweets = tweepy.Cursor(api.search, tweet_mode=’extended’) 
# for tweet in tweets:
#     content = tweet.full_text

In [None]:
# tweets_list = tweepy.Cursor(api.search, q="#Covid-19 since:" + str(yesterday)+ " until:" + str(today),tweet_mode='extended', lang='en').items()

In [None]:
# tweets_list = tweepy.Cursor(api.search, q=f"#Covid-19 since:{str(yesterday)} until:{str(today)}",tweet_mode='extended', lang='en').items()

In [29]:
# tweets_list = tweepy.Cursor(api.search, q=['astrazeneca', 'pfizer'],since= str(since), until=str(until),tweet_mode='extended', lang='en').items()

# Greek Language = el
# tweets_list = tweepy.Cursor(api.search, q=['coffee island'],since= str(since), until=str(until),tweet_mode='extended', lang='el').items()

# English Language = en
tweets_list = tweepy.Cursor(api.search, q=['coffee island OR CoffeeIsland'],since= str(since), until=str(until),tweet_mode='extended', lang='en').items()

Now we loop across the `tweets_list`, and, for each tweet, we extract the text, the creation date, the number of retweets and the favourite count. We store every tweet into a list, called `output`.

In [26]:
import time
seconds = 5
start = time.time()
time.sleep(seconds)
end = time.time()
logger.info(f"elapsed_time: '{end - start}'")

2021-06-25 03:00:56,957 - INFO - elapsed_time: '5.004919528961182'


---
# TEST

---

In [17]:
# tweets_list2 = tweepy.Cursor(api.search, q=['pfizer','astrazeneca'],since= str(since), until=str(until),tweet_mode='extended', lang='en').items(2)

import time
start = time.time()
output = []
for tweet in tweets_list:
    # text = tweet._json["full_text"]
    #print(text) 
    # https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets           
    # "geo": null,"coordinates": null,"place": null,"contributors": null,
    # "is_quote_status": false,"retweet_count": 988,"favorite_count": 3875,
    # "favorited": false,"retweeted": false,"possibly_sensitive": false,"lang": "en"
    # https://developer.twitter.com/en/docs/twitter-ids
    logger.info(f"tweet_id_str: {'-'*30}")
    logger.info(f"created_at: {tweet.created_at}")
    logger.info(f"full_text: {tweet._json['full_text']}")
    logger.info(f"tweet_id: {tweet.id}")
    logger.info(f"tweet_id_str: {tweet.id_str}")
    logger.info(f"user: {tweet._json['user']['name']}")
    logger.info(f"user_id: {tweet._json['user']['id']}")    
    # favourite_count = tweet.favorite_count
    # retweet_count = tweet.retweet_count
    # created_at = tweet.created_at
    
#     line = {'text' : text, 'favourite_count' : favourite_count, 'retweet_count' : retweet_count, 'created_at' : created_at}
#     output.append(line)
#     logger.info(f"Append list length : { len(output)}")
# end = time.time()
# logger.info(f"elapsed_time: '{end - start}'")

2021-06-24 23:18:06,420 - DEBUG - PARAMS: {'q': b"['coffee island']", 'since': b'2021-06-22', 'until': b'2021-06-24', 'tweet_mode': b'extended', 'lang': b'gr'}
2021-06-24 23:18:06,428 - DEBUG - Signing request <PreparedRequest [GET]> using client <Client client_key=pw0ihLFxH3nwDrd4HBd7pqUrc, client_secret=****, resource_owner_key=1360011857969479682-iLrxBUlqdtExwkqiN9iZsHYDXIFTZz, resource_owner_secret=****, signature_method=HMAC-SHA1, signature_type=AUTH_HEADER, callback_uri=None, rsa_key=None, verifier=None, realm=None, encoding=utf-8, decoding=None, nonce=None, timestamp=None>
2021-06-24 23:18:06,429 - DEBUG - Including body in call to sign: False
2021-06-24 23:18:06,442 - DEBUG - Collected params: [('q', "['coffee island']"), ('since', '2021-06-22'), ('until', '2021-06-24'), ('tweet_mode', 'extended'), ('lang', 'gr'), ('oauth_nonce', '89097120203846973971624565886'), ('oauth_timestamp', '1624565886'), ('oauth_version', '1.0'), ('oauth_signature_method', 'HMAC-SHA1'), ('oauth_consum

---

In [30]:
import time
start = time.time()
output = []
for tweet in tweets_list:
    text = tweet._json["full_text"]
    #print(text) 
    # https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets           
    # "geo": null,"coordinates": null,"place": null,"contributors": null,
    # "is_quote_status": false,"retweet_count": 988,"favorite_count": 3875,
    # "favorited": false,"retweeted": false,"possibly_sensitive": false,"lang": "en"
    logger.debug(f"full_text: '{text}'")
    favourite_count = tweet.favorite_count
    retweet_count = tweet.retweet_count
    created_at = tweet.created_at
    
    line = {'text' : text, 'favourite_count' : favourite_count, 'retweet_count' : retweet_count, 'created_at' : created_at}
    output.append(line)
    logger.info(f"Append list length : { len(output)}")
end = time.time()
logger.info(f"elapsed_time: '{end - start}'")

 full_text: '@jessicahodlr DM use to reserve Satoshi's Selection coffee. its Bitcoin coffee from the enchanted island of Puerto Rico 🇵🇷☕️ Available in 2 weeks for shipping'
2021-06-25 03:18:30,485 - INFO - Append list length : 604
2021-06-25 03:18:30,486 - DEBUG - full_text: 'Colourful handmade chocolates being boxed up from Arlene's Island Artisan Chocolate 🍫 😋

Available here  ➡️ https://t.co/yT7USQMTcx https://t.co/BzsOiRjgoh'
2021-06-25 03:18:30,486 - INFO - Append list length : 605
2021-06-25 03:18:30,487 - DEBUG - full_text: '90% of socializing in Rhode Island involves iced coffee'
2021-06-25 03:18:30,488 - INFO - Append list length : 606
2021-06-25 03:18:30,489 - DEBUG - full_text: 'ここー
[Friends+]Treasure Hunt Island #VRCinHere 21:37:10'
2021-06-25 03:18:30,490 - INFO - Append list length : 607
2021-06-25 03:18:30,490 - DEBUG - full_text: '@island_iverson This tweet made me spill hot coffee on my d region. Good stuff.'
2021-06-25 03:18:30,491 - INFO - Append list length : 608
20

In [31]:
output

,
 {'text': 'RT @mir_gifs: June 19th bunny and animal crossing island coffee status ☕️🐰🐶 https://t.co/cHdGkmEMN1',
  'favourite_count': 0,
  'retweet_count': 4,
  'created_at': datetime.datetime(2021, 6, 19, 3, 4, 41)},
 {'text': 'June 19th bunny and animal crossing island coffee status ☕️🐰🐶 https://t.co/cHdGkmEMN1',
  'favourite_count': 14,
  'retweet_count': 4,
  'created_at': datetime.datetime(2021, 6, 19, 2, 55, 57)},
 {'text': 'RT @cathsherman: Coral and Limestone Beach, Great Abaco Island, The Bahamas #Bahamas #FineArtAmerica #CLS \n#ArtForSale\nhttps://t.co/5Uk1lc1…',
  'favourite_count': 0,
  'retweet_count': 39,
  'created_at': datetime.datetime(2021, 6, 19, 2, 37, 42)},
 {'text': "@BjornIronsights At one point in time I had one overnight a month (developmentally disabled kiddo.)\nI drove 2 or 3 hours to a scruffy little seaside motel in Long Island. I'd bring a journal and my guitar. No TV. Fried clams. Coffee. Pancakes. Room 66 every time.",
  'favourite_count': 0,
  'retwee

In [32]:
len(output)

644

---
### create pdf from list

Finally, we convert the `output` list to a `pandas DataFrame` and we store results.

In [33]:
pdf_cof_island = pd.DataFrame(output)


In [34]:
pdf_cof_island.shape

(644, 4)

In [35]:
pdf_cof_island.head(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at
0,Stop by the @nespressousa store after my docto...,0,0,2021-06-24 23:53:05
1,RT @witch_mote: Legend has is that if you fini...,0,143,2021-06-24 23:48:56


In [40]:
pdf_cof_island.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   text             644 non-null    object        
 1   favourite_count  644 non-null    int64         
 2   retweet_count    644 non-null    int64         
 3   created_at       644 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 20.2+ KB


Selecting specific columns in a pandas dataframe

In [37]:
pdf_cof_island[['text', 'created_at']]

Unnamed: 0,text,created_at
0,Stop by the @nespressousa store after my docto...,2021-06-24 23:53:05
1,RT @witch_mote: Legend has is that if you fini...,2021-06-24 23:48:56
2,RT @SafemoonWarrior: #SAFEMOON🕵🏻‍♂️ \nCan some...,2021-06-24 23:45:59
3,@orangepillpod @maxkeiser @stacyherbert DM us ...,2021-06-24 23:34:34
4,RT @SafemoonWarrior: #SAFEMOON🕵🏻‍♂️ \nCan some...,2021-06-24 23:24:31
...,...,...
639,@ARlESVENUSX the coffee ice cream slander here...,2021-06-18 01:58:12
640,"""She also loves to travel around the world and...",2021-06-18 01:40:34
641,"RT @lizwadsworth65: @quinncy Exile and Duke, a...",2021-06-18 01:01:30
642,"@quinncy Exile and Duke, a used bookstore/cafe...",2021-06-18 00:59:47


In [39]:
pdf_cof_island[['text', 'created_at']].groupby('created_at').first()

Unnamed: 0_level_0,text
created_at,Unnamed: 1_level_1
2021-06-18 00:17:45,@bananzers I wanna visit the big island and dr...
2021-06-18 00:59:47,"@quinncy Exile and Duke, a used bookstore/cafe..."
2021-06-18 01:01:30,"RT @lizwadsworth65: @quinncy Exile and Duke, a..."
2021-06-18 01:40:34,"""She also loves to travel around the world and..."
2021-06-18 01:58:12,@ARlESVENUSX the coffee ice cream slander here...
...,...
2021-06-24 23:24:31,RT @SafemoonWarrior: #SAFEMOON🕵🏻‍♂️ \nCan some...
2021-06-24 23:34:34,@orangepillpod @maxkeiser @stacyherbert DM us ...
2021-06-24 23:45:59,RT @SafemoonWarrior: #SAFEMOON🕵🏻‍♂️ \nCan some...
2021-06-24 23:48:56,RT @witch_mote: Legend has is that if you fini...


---
### save and read pdf without header

In [28]:
pdf_cof_island.to_csv('output_cof_island.csv', mode='a', header = False, index = False )
#df.to_csv('output.csv')

In [29]:
pdf_cof_island = pd.read_csv('output3.csv', names=['text',	'favourite_count',	'retweet_count','created_at'], parse_dates=['created_at',])
df2.head(5)

Unnamed: 0,text,favourite_count,retweet_count,created_at
0,text,favourite_count,retweet_count,created_at
1,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37
2,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33
3,RT @TimWattsMP: Can you believe this?\n\nThe M...,0,212,2021-06-20 23:58:46
4,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20 23:57:54


In [30]:
pdf=df2

pdf.shape(

TypeError: 'tuple' object is not callable

In [31]:
pdf_cof_island.shape

(13089, 4)

---
### save a pdf to a csv file with header

In [46]:
#df = pd.DataFrame(output)
pdf_cof_island.to_csv('output_cof_island.csv', mode='a', header=True, index = False)

---
# Create a pdf and a sdf from a csv file with header

---
### pdf

In [39]:
pdf_cof_island2 = pd.read_csv('output_cof_island.csv', parse_dates=['created_at',])
pdf_cof_island2.head(5)

Unnamed: 0,text,favourite_count,retweet_count,created_at
0,RT @NikosPachilas: Mikel και Coffee Island αντ...,0,2,2021-06-24 12:26:01
1,RT @NikosPachilas: Mikel και Coffee Island αντ...,0,2,2021-06-24 12:20:35
2,RT @RealTime_eu: Mikel και Coffee Island ανταγ...,0,2,2021-06-24 12:20:27
3,RT @RealTime_eu: Mikel και Coffee Island ανταγ...,0,2,2021-06-24 12:19:07
4,Mikel και Coffee Island ανταγωνίζονται στην αγ...,2,2,2021-06-24 12:15:01


In [23]:
pdf_cof_island2.shape

(14, 4)

In [25]:
pdf_cof_island2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   text             14 non-null     object        
 1   favourite_count  14 non-null     int64         
 2   retweet_count    14 non-null     int64         
 3   created_at       14 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 576.0+ bytes


#### Convert a pdf datetime column to date

In [36]:
pdf_cof_island2['created_at'] = pdf_cof_island2['created_at'].dt.date

In [37]:
pdf_cof_island2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text             14 non-null     object
 1   favourite_count  14 non-null     int64 
 2   retweet_count    14 non-null     int64 
 3   created_at       14 non-null     object
dtypes: int64(2), object(2)
memory usage: 576.0+ bytes


In [38]:
pdf_cof_island2['created_at'][0]

datetime.date(2021, 6, 24)

In [29]:
# pdf_cof_island2 = pd.to_csv('output_cof_island.csv', date_format='%Y-%m-%d')
# 
# pdf_cof_island2.head(5)

In [111]:
pdf.describe()

Unnamed: 0,favourite_count,retweet_count,negative_nltk,positive_nltk,neutral_nltk,compound_nltk
count,6510.0,6510.0,6510.0,6510.0,6510.0,6510.0
mean,1.997081,903.088172,0.047635,0.038732,0.913632,-0.021985
std,54.727334,1714.038097,0.083419,0.070628,0.106266,0.358451
min,0.0,0.0,0.0,0.0,0.474,-0.9578
25%,0.0,8.0,0.0,0.0,0.847,-0.0772
50%,0.0,200.0,0.0,0.0,0.9605,0.0
75%,0.0,1307.0,0.096,0.073,1.0,0.0
max,3482.0,19068.0,0.375,0.526,1.0,0.9402


---
## sdf

In [18]:
# spark.read.csv(
#     "some_input_file.csv", 
#     header=True, 
#     mode="DROPMALFORMED", 
#     schema=schema
# )

# or

# (
#     spark.read
#     .schema(schema)
#     .option("header", "true")
#     .option("mode", "DROPMALFORMED")
#     .csv("some_input_file.csv")
# )

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, DateType

schema = StructType([
    StructField("text", StringType()),
    StructField("favourite_count", IntegerType()),
    StructField("retweet_count", IntegerType()),
    StructField("created_at", DateType())
])
# https://datascience.stackexchange.com/questions/12727/reading-csvs-with-new-lines-in-fields-with-spark
sdf = spark.read.csv('output3.csv',schema=schema,header=True, escape = '"', multiLine=True)

In [19]:
type(sdf)
# pyspark.sql.dataframe.DataFrame
sdf.printSchema()

root
 |-- text: string (nullable = true)
 |-- favourite_count: integer (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- created_at: date (nullable = true)



In [20]:
sdf.count() 

6510

In [21]:
sdf.columns

['text', 'favourite_count', 'retweet_count', 'created_at']

In [22]:
len(sdf.columns)

4

In [None]:
# Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape
# Then you can do

# >>> df.shape()
# (10000, 10)

In [23]:
sdf.toPandas().head(5)

Unnamed: 0,text,favourite_count,retweet_count,created_at
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20
2,RT @TimWattsMP: Can you believe this?\n\nThe M...,0,212,2021-06-20
3,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20
4,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20


In [29]:
s = sdf.toPandas()

s.shape

(6510, 4)

---
### def sentiment_scores & sentiment_scoresUDF

In [30]:
import sys
from pyspark.sql.functions import udf
# DoubleType, FloatType, ByteType, IntegerType, LongType, ShortType, ArrayType,StructField, StructType, Row
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pyspark.sql.types as Types
from nltk.sentiment.vader import SentimentIntensityAnalyzer



def sentiment_scores(sentance: str) -> dict :
    # Create a SentimentIntensityAnalyzer object.
    sid = SentimentIntensityAnalyzer()
    # polarity_scores method of SentimentIntensityAnalyzer
    # oject gives a sentiment dictionary.
    # which contains pos, neg, neu, and compound scores.
    r = sid.polarity_scores(sentance);
    return r
    # You can optionally set the return type of your UDF. The default return type␣,→is StringType.
    # udffactorial_p = udf(factorial_p, LongType())

sentiment_scoresUDF = udf(sentiment_scores, Types.MapType(Types.StringType(),Types.DoubleType()))

---
### pdf

#### create a new column with sentiment_scores

In [56]:
#df3['rating'] = df3['text'].apply(sid.polarity_scores)

pdf['rating'] = pdf['text'].apply(sentiment_scores)

In [57]:
pdf.head(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at,rating
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37,"{'neg': 0.049, 'neu': 0.752, 'pos': 0.199, 'co..."
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33,"{'neg': 0.0, 'neu': 0.603, 'pos': 0.397, 'comp..."


In [33]:
pdf.tail(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at,rating
6508,"RT @JohnRHewson: For the record, how many of o...",0,890,2021-06-19 00:00:27,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
6509,RT @DrEricDing: 3) “After years of reading res...,0,55,2021-06-19 00:00:08,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


In [58]:
pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text             6510 non-null   object
 1   favourite_count  6510 non-null   int64 
 2   retweet_count    6510 non-null   int64 
 3   created_at       6510 non-null   object
 4   rating           6510 non-null   object
dtypes: int64(2), object(3)
memory usage: 254.4+ KB


---
### sdf

create a new column with sentiment_scores

In [40]:
from pyspark.sql.functions import col,sqrt,log,reverse
sdf = sdf.withColumn("rating", sentiment_scoresUDF(sdf.text))
# t.show()
sdf.toPandas().style.set_properties(subset=['text'], **{'width': '300px'})

NameError: name 'sdf' is not defined

In [60]:
sdf.toPandas().head(2).style.set_properties(subset=['text'], **{'width': '300px'})

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk,rating
0,"@breakfasttv I received the Astrazeneca vaccine on April 9th. No, I do not regret it! I look forward to getting the Pfizer vaccine on the 28th of this month. I feel confident we will be fully protected. Even if it is unconventional. 😊",0,0,2021-06-20,0.049,0.199,0.752,0.7793,"{'neg': 0.049, 'pos': 0.199, 'compound': 0.7793, 'neu': 0.752}"
1,#Pfizer #AstraZeneca #Moderna #JohnsonAndJohnson What's up ! Safe and effective ! ⬇️ ⬇️ https://t.co/fvw1H8cmXT,3,5,2021-06-20,0.0,0.397,0.603,0.7639,"{'neg': 0.0, 'pos': 0.397, 'compound': 0.7639, 'neu': 0.603}"


In [61]:
sdf.toPandas().tail(2).style.set_properties(subset=['text'], **{'width': '300px'})

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk,rating
6508,"RT @JohnRHewson: For the record, how many of our political leaders had Pfizer rather than AstraZeneca?",0,890,2021-06-19,0.0,0.0,1.0,0.0,"{'neg': 0.0, 'pos': 0.0, 'compound': 0.0, 'neu': 1.0}"
6509,RT @DrEricDing: 3) “After years of reading research on mixing vaccine types -- known as heterologous prime-boosting -- Morgon concluded tha…,0,55,2021-06-19,0.0,0.0,1.0,0.0,"{'neg': 0.0, 'pos': 0.0, 'compound': 0.0, 'neu': 1.0}"


In [62]:
sdf.show(2, truncate = 30)

+------------------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+------------------------------+
|                          text|favourite_count|retweet_count|created_at|negative_nltk|positive_nltk|neutral_nltk|compound_nltk|                        rating|
+------------------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+------------------------------+
|@breakfasttv I received the...|              0|            0|2021-06-20|        0.049|        0.199|       0.752|       0.7793|[neg -> 0.049, pos -> 0.199...|
|#Pfizer
#AstraZeneca
#Moder...|              3|            5|2021-06-20|          0.0|        0.397|       0.603|       0.7639|[neg -> 0.0, pos -> 0.397, ...|
+------------------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+------------------------------+
only showing top 2 rows



In [49]:
# Returns the first num rows as a list of Row.
# This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
sdf.head(2)

[Row(text='@breakfasttv I received the Astrazeneca vaccine on April 9th. No, I do not regret it! I look forward to getting the Pfizer vaccine on the 28th of this month. I feel confident we will be fully protected. Even if it is unconventional. 😊', favourite_count=0, retweet_count=0, created_at=datetime.date(2021, 6, 20), rating={'neg': 0.049, 'pos': 0.199, 'compound': 0.7793, 'neu': 0.752}),
 Row(text="#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJohnson \n\nWhat's up !  Safe and effective !\n\n⬇️\n⬇️ https://t.co/fvw1H8cmXT", favourite_count=3, retweet_count=5, created_at=datetime.date(2021, 6, 20), rating={'neg': 0.0, 'pos': 0.397, 'compound': 0.7639, 'neu': 0.603})]

In [50]:
# Returns the last num rows as a list of Row.
sdf.tail(2)

[Row(text='RT @JohnRHewson: For the record, how many of our political leaders had Pfizer rather than AstraZeneca?', favourite_count=0, retweet_count=890, created_at=datetime.date(2021, 6, 19), rating={'neg': 0.0, 'pos': 0.0, 'compound': 0.0, 'neu': 1.0}),
 Row(text='RT @DrEricDing: 3) “After years of reading research on mixing vaccine types -- known as heterologous prime-boosting -- Morgon concluded tha…', favourite_count=0, retweet_count=55, created_at=datetime.date(2021, 6, 19), rating={'neg': 0.0, 'pos': 0.0, 'compound': 0.0, 'neu': 1.0})]

In [63]:
sdf.printSchema()

root
 |-- text: string (nullable = true)
 |-- favourite_count: integer (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- created_at: date (nullable = true)
 |-- negative_nltk: double (nullable = true)
 |-- positive_nltk: double (nullable = true)
 |-- neutral_nltk: double (nullable = true)
 |-- compound_nltk: double (nullable = true)
 |-- rating: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)



In [64]:
 from pyspark.sql.functions import col 

 sdf = sdf.withColumn('negative_nltk', col('rating')['neg']) \
.withColumn('positive_nltk', col('rating')['pos']) \
.withColumn('neutral_nltk', col('rating')['neu']) \
.withColumn('compound_nltk',col('rating')['compound']) \
.drop('rating')


In [65]:
sdf.toPandas().head(2).style.set_properties(subset=['text'], **{'width': '300px'})

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk
0,"@breakfasttv I received the Astrazeneca vaccine on April 9th. No, I do not regret it! I look forward to getting the Pfizer vaccine on the 28th of this month. I feel confident we will be fully protected. Even if it is unconventional. 😊",0,0,2021-06-20,0.049,0.199,0.752,0.7793
1,#Pfizer #AstraZeneca #Moderna #JohnsonAndJohnson What's up ! Safe and effective ! ⬇️ ⬇️ https://t.co/fvw1H8cmXT,3,5,2021-06-20,0.0,0.397,0.603,0.7639


In [66]:
sdf.toPandas().tail(2).style.set_properties(subset=['text'], **{'width': '300px'})

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk
6508,"RT @JohnRHewson: For the record, how many of our political leaders had Pfizer rather than AstraZeneca?",0,890,2021-06-19,0.0,0.0,1.0,0.0
6509,RT @DrEricDing: 3) “After years of reading research on mixing vaccine types -- known as heterologous prime-boosting -- Morgon concluded tha…,0,55,2021-06-19,0.0,0.0,1.0,0.0


In [228]:
sdf_from_list_of_rows = spark.createDataFrame(sdf.tail(2))

sdf_from_list_of_rows.show()

+--------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+
|                text|favourite_count|retweet_count|created_at|negative_nltk|positive_nltk|neutral_nltk|compound_nltk|
+--------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+
|RT @JohnRHewson: ...|              0|          890|2021-06-19|          0.0|          0.0|         1.0|          0.0|
|RT @DrEricDing: 3...|              0|           55|2021-06-19|          0.0|          0.0|         1.0|          0.0|
+--------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+



In [229]:
sdf_from_pdf = spark.createDataFrame(pdf)

sdf_from_list_of_rows.show(2)

+--------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+
|                text|favourite_count|retweet_count|created_at|negative_nltk|positive_nltk|neutral_nltk|compound_nltk|
+--------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+
|RT @JohnRHewson: ...|              0|          890|2021-06-19|          0.0|          0.0|         1.0|          0.0|
|RT @DrEricDing: 3...|              0|           55|2021-06-19|          0.0|          0.0|         1.0|          0.0|
+--------------------+---------------+-------------+----------+-------------+-------------+------------+-------------+



https://stackoverflow.com/questions/61608057/output-vader-sentiment-scores-in-columns-based-on-dataframe-rows-of-tweets

---

In [67]:
pdf

Unnamed: 0,text,favourite_count,retweet_count,created_at,rating
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37,"{'neg': 0.049, 'neu': 0.752, 'pos': 0.199, 'co..."
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33,"{'neg': 0.0, 'neu': 0.603, 'pos': 0.397, 'comp..."
2,RT @TimWattsMP: Can you believe this?\n\nThe M...,0,212,2021-06-20 23:58:46,"{'neg': 0.124, 'neu': 0.876, 'pos': 0.0, 'comp..."
3,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20 23:57:54,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
4,RT @DrEricDing: 5) One dose of the vaccine is ...,0,270,2021-06-20 23:56:29,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
...,...,...,...,...,...
6505,@Fab485617452 @JohnRHewson I'm 73 and my wife ...,6,0,2021-06-19 00:02:33,"{'neg': 0.0, 'neu': 0.927, 'pos': 0.073, 'comp..."
6506,RT @kyle_minogue: @bjornradstrom Pfizer - 12-1...,0,2,2021-06-19 00:02:21,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
6507,"RT @JohnRHewson: For the record, how many of o...",0,890,2021-06-19 00:01:32,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
6508,"RT @JohnRHewson: For the record, how many of o...",0,890,2021-06-19 00:00:27,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


In [68]:
pdf['negative_nltk']=[i['neg'] for i in pdf.rating]
pdf['positive_nltk']=[i['pos'] for i in pdf.rating]
pdf['neutral_nltk']=[i['neu'] for i in pdf.rating]
pdf['ncompound_nltk']=[i['compound'] for i in pdf.rating]

pdf.head(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at,rating,negative_nltk,positive_nltk,neutral_nltk,ncompound_nltk
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37,"{'neg': 0.049, 'neu': 0.752, 'pos': 0.199, 'co...",0.049,0.199,0.752,0.7793
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33,"{'neg': 0.0, 'neu': 0.603, 'pos': 0.397, 'comp...",0.0,0.397,0.603,0.7639


In [69]:
pdf['negative_nltk'] = pdf['rating'].apply(lambda x : x['neg'])
pdf['positive_nltk'] = pdf['rating'].apply(lambda x : x['pos'])
pdf['neutral_nltk'] = pdf['rating'].apply(lambda x : x['neu'])
pdf['compound_nltk'] = pdf['rating'].apply(lambda x : x['compound'])

pdf = pdf.drop('rating', axis=1)
pdf.head(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,ncompound_nltk
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37,0.049,0.199,0.752,0.7793
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33,0.0,0.397,0.603,0.7639


In [13]:
# Create the dataframe
# df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])

# Function to get rows at `rownums`
# def getrows(df, rownums=None):
    # return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])

# Get rows at positions 0 and 2.
# getrows(df, rownums=[0, 2]).collect()

In [14]:
pdf.info

NameError: name 'pdf' is not defined

show last's row date

In [70]:
pdf['created_at'][pdf.shape[0]-1]

'2021-06-19 00:00:08'

In [71]:
pdf['text'].count()

6510

In [72]:
pdf.count()

text               6510
favourite_count    6510
retweet_count      6510
created_at         6510
negative_nltk      6510
positive_nltk      6510
neutral_nltk       6510
ncompound_nltk     6510
dtype: int64

In [73]:
pdf.count(axis=0)

text               6510
favourite_count    6510
retweet_count      6510
created_at         6510
negative_nltk      6510
positive_nltk      6510
neutral_nltk       6510
ncompound_nltk     6510
dtype: int64

In [74]:
pdf.shape[0]

6510

In [75]:
len(pdf.index)

6510

---

# Spark GroupBy and Aggregate Functions

`GroupBy` allows you to group rows together based off some column value, for example, you could group together sales data by the day the sale occured, or group repeast customer data based off the name of the customer.

Once you've performed the GroupBy operation you can use an aggregate function off that data.An `aggregate function` aggregates multiple rows of data into a single output, such as taking the sum of inputs, or counting the number of inputs.

**`Dataframe Aggregation`**

A set of methods for aggregations on a DataFrame:

    agg
    avg
    count
    max
    mean
    min
    pivot
    sum

https://hendra-herviawan.github.io/pyspark-groupby-and-aggregate-functions.html


In [77]:
sdf.columns

['text',
 'favourite_count',
 'retweet_count',
 'created_at',
 'negative_nltk',
 'positive_nltk',
 'neutral_nltk',
 'compound_nltk']

In [108]:
sdf.select('created_at')

DataFrame[created_at: date]


# sdf_agg_byDate =  sdf.groupBy('created_at').agg({'negative_nltk':'sum'}, {'positive_nltk':'sum'}, {'neutral_nltk':'sum'},	{'compound_nltk':'sum'})
import pyspark.sql.functions as f
from pyspark.sql.functions import col

sdf_agg_byDate =  sdf.groupBy('created_at').agg({'negative_nltk':'sum','positive_nltk':'sum','neutral_nltk':'sum','compound_nltk':'sum','created_at':'count'})

sdf_agg_byDate.count()

In [79]:
# sdf.groupBy('created_at').count().select('created_at', f.col('count').alias('tweets')).show()

In [81]:
(sdf
.groupBy('created_at')
.agg(f.count('created_at').alias('tweets'))
.show())

+----------+------+
|created_at|tweets|
+----------+------+
|2021-06-20|  2407|
|2021-06-19|  4103|
+----------+------+



In [82]:
sdf_agg_byDate.show()

+----------+-------------------+------------------+-----------------+------------------+------------------+
|created_at| sum(compound_nltk)|sum(positive_nltk)|count(created_at)|sum(negative_nltk)| sum(neutral_nltk)|
+----------+-------------------+------------------+-----------------+------------------+------------------+
|2021-06-20| -36.28830000000009| 112.9250000000001|             2407|           133.614|2160.4989999999916|
|2021-06-19|-106.83589999999874| 139.2199999999993|             4103|176.49200000000073|3787.2440000000083|
+----------+-------------------+------------------+-----------------+------------------+------------------+



In [None]:
# sdf_agg_byDate = sdf_agg_byDate.withColumn("sum", col("sum(negative_nltk)")+col("science_score"))
# 	df1.show()

In [83]:
# How to delete columns in pyspark dataframe
columns_to_drop = ['sum(compound_nltk)', 'sum(positive_nltk)', 'sum(negative_nltk)', 'sum(neutral_nltk)']

sdf_agg_byDate = (sdf_agg_byDate
    .withColumn('compound_nltk', f.col('sum(compound_nltk)')/f.col('count(created_at)'))
    .withColumn( 'positive_nltk',f.col('sum(positive_nltk)')/f.col('count(created_at)'))
    .withColumn( 'negativen_ltk',f.col('sum(negative_nltk)')/f.col('count(created_at)'))
    .withColumn( 'neutral_nltk',f.col('sum(neutral_nltk)')/f.col('count(created_at)'))
    .withColumnRenamed('count(created_at)', 'tweets')).drop(*columns_to_drop)

sdf_agg_byDate.show()

+----------+------+--------------------+--------------------+--------------------+------------------+
|created_at|tweets|       compound_nltk|       positive_nltk|       negativen_ltk|      neutral_nltk|
+----------+------+--------------------+--------------------+--------------------+------------------+
|2021-06-20|  2407|-0.01507615288741...|0.046915247195679306|0.055510594100540094|0.8975899459908565|
|2021-06-19|  4103|-0.02603848403607...|0.033931269802583305|0.043015354618571956| 0.923042651718257|
+----------+------+--------------------+--------------------+--------------------+------------------+



In [97]:
sdf_agg_byDate.toPandas()
# .drop(*columns_to_drop)

Unnamed: 0,created_at,count(created_at),compound_nltk,positive_nltk,negativen_ltk,neutral_nltk
0,2021-06-20,2407,-0.015076,0.046915,0.055511,0.89759
1,2021-06-19,4103,-0.026038,0.033931,0.043015,0.923043


In [101]:
sdf.created_at[0]

Column<b'created_at[0]'>

In [84]:
# The below statement changes column 'count(created_at)' to 'tweets' on PySpark DataFrame. 
sdf_agg_byDate = (sdf_agg_byDate
    .withColumnRenamed('count(created_at)', 'tweets'))

In [85]:
sdf_agg_byDate.toPandas()

Unnamed: 0,created_at,tweets,compound_nltk,positive_nltk,negativen_ltk,neutral_nltk
0,2021-06-20,2407,-0.015076,0.046915,0.055511,0.89759
1,2021-06-19,4103,-0.026038,0.033931,0.043015,0.923043


In [218]:
# DataFrame.sort(*cols, **kwargs) - Returns a new DataFrame sorted by the specified column(s).
sdf_agg_byDate = sdf_agg_byDate.sort('created_at')

sdf_agg_byDate.toPandas()

Unnamed: 0,created_at,tweets,compound_nltk,positive_nltk,negativen_ltk,neutral_nltk
0,2021-06-19,4103,-0.026038,0.033931,0.043015,0.923043
1,2021-06-20,2407,-0.015076,0.046915,0.055511,0.89759


In [224]:
# 
sdf_agg_byDate.dtypes

[('created_at', 'date'),
 ('tweets', 'bigint'),
 ('compound_nltk', 'double'),
 ('positive_nltk', 'double'),
 ('negativen_ltk', 'double'),
 ('neutral_nltk', 'double')]

In [225]:
sdf_agg_byDate.head(1)

[Row(created_at=datetime.date(2021, 6, 19), tweets=4103, compound_nltk=-0.026038484036070862, positive_nltk=0.033931269802583305, negativen_ltk=0.043015354618571956, neutral_nltk=0.923042651718257)]

In [226]:
sdf_agg_byDate.limit(1).show()

+----------+------+--------------------+--------------------+--------------------+-----------------+
|created_at|tweets|       compound_nltk|       positive_nltk|       negativen_ltk|     neutral_nltk|
+----------+------+--------------------+--------------------+--------------------+-----------------+
|2021-06-19|  4103|-0.02603848403607...|0.033931269802583305|0.043015354618571956|0.923042651718257|
+----------+------+--------------------+--------------------+--------------------+-----------------+



In [223]:
# last, head, tail
sdf_agg_byDate.first()

Row(created_at=datetime.date(2021, 6, 19), tweets=4103, compound_nltk=-0.026038484036070862, positive_nltk=0.033931269802583305, negativen_ltk=0.043015354618571956, neutral_nltk=0.923042651718257)

In [None]:
import datetime, time 
# dates = ("2021-06-19 00:00:00",  "2021-06-20 00:00:00")
date_from, date_to = ("2021-06-19",  "2021-06-20")

timestamps = (
    time.mktime(datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S").timetuple())
    for s in dates)



sdf_agg_byDate.filter(col('created_at') > )

In [220]:
# sqlContext.sql("set spark.sql.shuffle.partitions=10")


spark.sql.functions.shuffle.partitions

AttributeError: 'function' object has no attribute 'shuffle'

---
## Pandas groupBy and aggregate functions 

In [88]:
# Change name of a specific column
# pdf = pdf.rename(columns={'ncompound_nltk':'compound_nltk'})
# pdf.head(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20 23:59:37,0.049,0.199,0.752,0.7793
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20 23:59:33,0.0,0.397,0.603,0.7639


In [202]:
pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   text             6510 non-null   object 
 1   favourite_count  6510 non-null   int64  
 2   retweet_count    6510 non-null   int64  
 3   created_at       6510 non-null   object 
 4   negative_nltk    6510 non-null   float64
 5   positive_nltk    6510 non-null   float64
 6   neutral_nltk     6510 non-null   float64
 7   compound_nltk    6510 non-null   float64
dtypes: float64(4), int64(2), object(2)
memory usage: 407.0+ KB


In [203]:
pdf['created_at'] = pd.to_datetime(pdf['created_at'])


In [204]:
pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   text             6510 non-null   object        
 1   favourite_count  6510 non-null   int64         
 2   retweet_count    6510 non-null   int64         
 3   created_at       6510 non-null   datetime64[ns]
 4   negative_nltk    6510 non-null   float64       
 5   positive_nltk    6510 non-null   float64       
 6   neutral_nltk     6510 non-null   float64       
 7   compound_nltk    6510 non-null   float64       
dtypes: datetime64[ns](1), float64(4), int64(2), object(1)
memory usage: 407.0+ KB


In [165]:
pdf.head(2)

Unnamed: 0,text,favourite_count,retweet_count,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk
0,@breakfasttv I received the Astrazeneca vaccin...,0,0,2021-06-20,0.049,0.199,0.752,0.7793
1,#Pfizer\n#AstraZeneca\n#Moderna\n#JohnsonAndJo...,3,5,2021-06-20,0.0,0.397,0.603,0.7639


In [205]:
pdf['created_at'] = pdf['created_at'].dt.strftime('%Y-%m-%d')

In [206]:
pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6510 entries, 0 to 6509
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   text             6510 non-null   object 
 1   favourite_count  6510 non-null   int64  
 2   retweet_count    6510 non-null   int64  
 3   created_at       6510 non-null   object 
 4   negative_nltk    6510 non-null   float64
 5   positive_nltk    6510 non-null   float64
 6   neutral_nltk     6510 non-null   float64
 7   compound_nltk    6510 non-null   float64
dtypes: float64(4), int64(2), object(2)
memory usage: 407.0+ KB


In [174]:
# For a DataFrame, by default the aggregates return results within each column:
# pdf['created_at'].dt.date => Converts Datetime to Date in Pandas df
pdf_agg_byDate =  pdf.groupby('created_at').agg({'negative_nltk':'sum','positive_nltk':'sum','neutral_nltk':'sum','compound_nltk':'sum', 'created_at':'size'})

pdf_agg_byDate.count()

negative_nltk    2
positive_nltk    2
neutral_nltk     2
compound_nltk    2
created_at       2
dtype: int64

In [175]:
pdf_agg_byDate

Unnamed: 0_level_0,negative_nltk,positive_nltk,neutral_nltk,compound_nltk,created_at
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-06-19,176.492,139.22,3787.244,-106.8359,4103
2021-06-20,133.614,112.925,2160.499,-36.2883,2407


In [209]:
pdf_agg_byDate =  pdf.groupby('created_at').agg(negative_nltk=('negative_nltk','sum'),positive_nltk=('positive_nltk','sum'),neutral_nltk=('neutral_nltk','sum'),compound_nltk=('compound_nltk','sum'), tweets = ('created_at','size'))

pdf_agg_byDate.reset_index(level=0, inplace=True)
pdf_agg_byDate.count()

created_at       2
negative_nltk    2
positive_nltk    2
neutral_nltk     2
compound_nltk    2
tweets           2
dtype: int64

In [210]:
# row 0
type(pdf.created_at[0])

str

In [211]:
# row 5
type(pdf.created_at[5])

str

In [212]:
pdf_agg_byDate

Unnamed: 0,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk,tweets
0,2021-06-19,176.492,139.22,3787.244,-106.8359,4103
1,2021-06-20,133.614,112.925,2160.499,-36.2883,2407


In [193]:
pdf_agg_byDate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   created_at     2 non-null      object 
 1   negative_nltk  2 non-null      float64
 2   positive_nltk  2 non-null      float64
 3   neutral_nltk   2 non-null      float64
 4   compound_nltk  2 non-null      float64
 5   tweets         2 non-null      int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 224.0+ bytes


In [20]:
pdf_cof_island2['created_at'] = pdf_cof_island2['created_at'].dt.date

AttributeError: Can only use .dt accessor with datetimelike values

In [21]:
pdf_cof_island2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   text             14 non-null     object
 1   favourite_count  14 non-null     int64 
 2   retweet_count    14 non-null     int64 
 3   created_at       14 non-null     object
dtypes: int64(2), object(2)
memory usage: 576.0+ bytes


### Divide multiple columns by another column in pandas

In [216]:

# pdf_agg_byDate = (pdf_agg_byDate.apply(['compound_nltk']/pdf_agg_byDate['tweets']
#     .withColumn( 'positive_nltk',f.col('sum(positive_nltk)')/f.col('count(created_at)'))
#     .withColumn( 'negativen_ltk',f.col('sum(negative_nltk)')/f.col('count(created_at)'))
#     .withColumn( 'neutral_nltk',f.col('sum(neutral_nltk)')/f.col('count(created_at)'))
#     .withColumnRenamed('count(created_at)', 'tweets')).drop(*columns_to_drop

pdf_agg_byDate[['negative_nltk','positive_nltk','neutral_nltk','compound_nltk']]=\
    (pdf_agg_byDate[['negative_nltk','positive_nltk','neutral_nltk','compound_nltk']].divide(pdf_agg_byDate ['tweets'], axis = 'index'))


pdf_agg_byDate


Unnamed: 0,created_at,negative_nltk,positive_nltk,neutral_nltk,compound_nltk,tweets
0,2021-06-19,0.043015,0.033931,0.923043,-0.026038,4103
1,2021-06-20,0.055511,0.046915,0.89759,-0.015076,2407


In [208]:
pdf_agg_byDate

Unnamed: 0,negative_nltk,positive_nltk,neutral_nltk,compound_nltk
0,0.043015,0.033931,0.923043,-0.026038
1,0.055511,0.046915,0.89759,-0.015076


# perfplot

In [68]:
import numpy as np
import pandas as pd
import perfplot

perfplot.save(
    "out.png",
    setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)),
    n_range=[2**k for k in range(25)],
    kernels=[
        lambda df: len(df.index),
        lambda df: df.shape[0],
        lambda df: df[df.columns[0]].count(),
    ],
    labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"],
    xlabel="Number of rows",
)

Output()