# Getting Information from Social Media (Twitter)

<img src="Images/pic1.JPG" alt="Drawing"/>

+ **Web Crawling** merupakan suatu program/sistem/script otomatis yang dengan suatu metode tertentu melakukan scanning halaman-halaman yang ada dalam sebuah website. Web crawling menlakukan indexing dan dapat mengambil informasi-informasi pada halaman website. Hasil dari web crawling biasanya akan digunakan untuk mempelajari isi dari halaman-halaman website.
+ **Streaming** merupakan 
+ **Web Scraping** merupakan suatu kegiatan yang dilakukan untuk mengambil informasi dari halaman website. Web scraping biasanya mengambil informasi dari HTML yang terdapat pada halaman website.

### Contoh Scraper/crawler

**Official Scraper/Crawler**   : Tweepy, Scrapy

**Unofficial Scraper/Crawler** : Twitterscraper, Scweet


### Example Scraping

Pada sesi ini, kita akan menggunakan salah satu contoh web scraper yaitu Scweet. Scweet melakukan scraping pada halaman website twitter. 

#### 1. Import needed library

Library yang dibutuhkan untuk scraping kali ini adalah scweet dan pandas

In [1]:
!pip install Scweet==1.0
!pip install pandas==1.1.3



In [2]:
from Scweet.scweet import scrap
from Scweet.user import get_user_information, get_users_following, get_users_followers
import pandas as pd

In [3]:
pd.__version__

'1.1.3'

#### 2. Scrape tweet with certain words

dengan menggunakan Scweet, kita dapat mengambil top tweets yang mengandung kata tertentu. caranya dengan menggunakan module scrap yang tersedia pada library Scweet.

In [4]:
# keywords
keywords = ['kurma']

# Date interval
initial_date = '2021-04-28'
finish_date = '2021-04-30'

all_datas = []
for x in keywords:
    data = scrap(words=x,
                 start_date=initial_date,
                 max_date=finish_date,
                 from_account=None,
                 interval=1, 
                 headless=True,
                 save_images=False,
                 display_type=None,
                 resume=False,
                 filter_replies=True,
                 proximity=True)
    
    data['keyword'] = x
    all_datas.append(data)

all_datas = pd.concat(all_datas)

Scraping on headless mode.
looking for tweets between 2021-04-28 and 2021-04-29 ...
 path : https://twitter.com/search?q=(kurma)%20until%3A2021-04-29%20since%3A2021-04-28%20%20-filter%3Areplies&src=typed_query&lf=on
scroll  1
scroll  2
looking for tweets between 2021-04-29 and 2021-04-30 ...
 path : https://twitter.com/search?q=(kurma)%20until%3A2021-04-30%20since%3A2021-04-29%20%20-filter%3Areplies&src=typed_query&lf=on
Tweet made at: 2021-04-29T11:21:06.000Z is found.
Tweet made at: 2021-04-29T12:06:34.000Z is found.
scroll  1
scroll  2
scroll  3


In [5]:
all_datas

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL,keyword
0,Lovaditya Dhika,@lovaditya,2021-04-29T11:21:06.000Z,bales chat panda edisi mikir bgt myampe baru s...,,,,,,[],https://twitter.com/lovaditya/status/138772864...,kurma
1,𝐿𝑜𝓋𝑒°𝓂𝓎𝓈𝑒𝓁𝒻,@Henny_purlina,2021-04-29T12:06:34.000Z,Maem kurma purun ?,,😌 💜,,,,[],https://twitter.com/Henny_purlina/status/13877...,kurma


In [6]:
all_datas[all_datas['keyword'] == 'es buah']

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL,keyword


In [7]:
# Save data to csv
filename = 'all_keywordsv2.csv'
all_datas.to_csv(filename, index=None)

In [8]:
hashtag = 'sopbuah'

initial_date = '2021-04-28'
finish_date = '2021-04-30'

data = scrap(hashtag=hashtag,
             start_date=initial_date,
             max_date=finish_date,
             from_account=None,
             interval=5,
             headless=True,
             display_type="Top",
             save_images=False, 
             resume=False,
             filter_replies=False,
             proximity=True)

Scraping on headless mode.


In [9]:
data.head()

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL


### Get the main information of a given list of users

In [10]:
users = ['@raisa6690', '@isyanasarasvati']

# this function return a list that contains : 
# ["nb of following","nb of followers", "join date", "birthdate", "location", "website", "description"]

users_info = get_user_information(users, headless=True)

Scraping on headless mode.
--------------- @raisa6690 information : ---------------
Following :  439
Followers :  9M
Location :  
Join date :  Joined August 2009
Birth date :  
Description :  I'm a singer, in love forever with it. In it for the love of music, not for the glitter and gold :) Contact : Boim +628568526196 / adryboim@junirecords.com
Website :  https://t.co/5H6zfx8M19?amp=1
--------------- @isyanasarasvati information : ---------------
Following :  98
Followers :  268.5K
Location :  
Join date :  Joined October 2011
Birth date :  
Description :  Musician | 
@redrose_records
 | CP : +6281314155565 (Sarah)
Website :  https://t.co/yU2UNpVQjX?amp=1


In [11]:
users_df = pd.DataFrame(users_info, index = ["nb of following",
                                             "nb of followers",
                                             "join date", 
                                             "birthdate",
                                             "location",
                                             "website",
                                             "description"]).T
users_df

Unnamed: 0,nb of following,nb of followers,join date,birthdate,location,website,description
@raisa6690,439,9M,Joined August 2009,,,https://t.co/5H6zfx8M19?amp=1,"I'm a singer, in love forever with it. In it f..."
@isyanasarasvati,98,268.5K,Joined October 2011,,,https://t.co/yU2UNpVQjX?amp=1,Musician | \n@redrose_records\n | CP : +628131...
