<a href="https://colab.research.google.com/github/Paul-mwaura/Natural-Language-Processing/blob/main/Data_Scraping_Job_Tweets_from_Twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Understanding

## Objective
Scrape job postings on twitter.

## Identifying job postings
1. Popular hashtags in kenya include #IkoKazi #IkoKaziKE, #jobsinkenya, #PataKaziKe, Kazi,#JobSeekersKE, #KenyanJobs, #kenyanjobsdaily

2. Other noted ones were #Gethired, #jobs, #jobsearch, #jobposting, #recruitment, #hiring, #hiring, #jobs, #jobsites, #jobshiring , #gethired. Question is, wouldn't these query for global jobs? Maybe searching for geotagged tweets e.g from Kenya but with hashtag as #jobs would ensure we only get Kenyan tweets

3. There are also other keywords which are not hashtaged e.g Looking For, Apply For etc 

4. Some twitter users also specify email address e.g vacancies@jantakenya.com so getting email address of all HR companies would increase database size.

5. Some of the accounts noted to post #ikokazi hashtagged jobs include @kenyanjobsblog, @CareerPointKe, @kazikenya, @myjobsin_Kenya, @JantaKenya


## Library
Considered using tweepy or twint. Tweepy has limit of 3200 tweets export for last 7 days. Twint does not have this limit hence it is the library of choice. Read more about twint here https://github.com/twintproject/twint

Another library was noted called twitterscaper but it's use is not explored yet.
https://github.com/taspinar/twitterscraper



## Data Extraction Using Twint

Data extraction adops only option 1 above where popular hashtags are queried

#### Import Libraries

In [None]:
# Download and import library
!pip3 install twint -q #twitter scraping package
!pip install nest_asyncio #To handle runtimeError encountered "This event loop is already running" when querying loser tweets for realdonaldtrump. When this was applied, more job tweets were extracted.

[K     |████████████████████████████████| 1.2MB 5.7MB/s 
[K     |████████████████████████████████| 245kB 17.3MB/s 
[K     |████████████████████████████████| 194kB 16.1MB/s 
[K     |████████████████████████████████| 266kB 16.1MB/s 
[K     |████████████████████████████████| 153kB 19.0MB/s 
[K     |████████████████████████████████| 81kB 7.1MB/s 
[K     |████████████████████████████████| 235kB 16.6MB/s 
[?25h  Building wheel for twint (setup.py) ... [?25l[?25hdone
  Building wheel for fake-useragent (setup.py) ... [?25l[?25hdone
  Building wheel for googletransx (setup.py) ... [?25l[?25hdone
  Building wheel for idna-ssl (setup.py) ... [?25l[?25hdone
  Building wheel for typing (setup.py) ... [?25l[?25hdone
Collecting nest_asyncio
  Downloading https://files.pythonhosted.org/packages/a0/44/f2983c5be9803b08f89380229997e92c4bdd7a4a510ccee565b599d1bdc8/nest_asyncio-1.4.0-py3-none-any.whl
Installing collected packages: nest-asyncio
Successfully installed nest-asyncio-1.4.0


In [None]:
import twint
import pandas as pd
import nest_asyncio
nest_asyncio.apply() 

#### Approach 1: Save to CSV directly

Noted twint is extracting tweets for last one month only in both approach 1 and 2. >6k tweets exported. Some google search needed to extract tweets for bigger window

In [None]:
##Search requirements
c = twint.Config()
c.Limit = 20000
c.Search = '#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kazi OR #PataKaziKe OR #JobSeekersKE OR #KenyanJobs OR #kenyanjobsdaily' 
c.since ='2020-05-01'
c.Store_csv = True
c.Output = "job_tweets.csv"
#Execute search
twint.run.Search(c)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1285482454955044876 2020-07-21 07:51:15 UTC <Simon_Ingari> Job vacancies in Nairobi WhatsApp Group -  https://opportunitiesforyoungkenyans.co.ke/2020/07/21/job-vacancies-nairobi-whatsapp-group-6/ … #IkoKaziKe #PataKaziKe #Gethired #JobSeekers ... pic.twitter.com/wrujViSQI1
1285482392862494720 2020-07-21 07:51:00 UTC <Simon_Ingari> Job Alert Industry – WhatsApp Group -  https://opportunitiesforyoungkenyans.co.ke/2020/07/21/job-alert-industry-whatsapp-group/ … #IkoKaziKe #PataKaziKe #Gethired #JobSeekers #Job... pic.twitter.com/YKxLveM6GB
1285482346867761159 2020-07-21 07:50:49 UTC <Simon_Ingari> Teachers Job Noticeboard WhatsApp Group -  https://opportunitiesforyoungkenyans.co.ke/2020/07/21/teachers-job-noticeboard-whatsapp-group/ … #IkoKaziKe #PataKaziKe #Gethired #JobSeekers ... pic.twitter.com/TbK3JvU1vG
1285482231822258176 2020-07-21 07:50:21 UTC <Simon_Ingari> Job vacancies in central Kenya – WhatsApp Group -  https:/

#### Approach 2: Save to pandas dataframe

In [None]:
##Search requirements
c = twint.Config()
c.Limit = 20000
c.Search = '#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kazi OR #PataKaziKe OR #JobSeekersKE OR #KenyanJobs OR #kenyanjobsdaily' 
#c.since ='2020-06-01'
c.Pandas = True

#Execute search
twint.run.Search(c)

#Map to dataframe for ease of cleaning
jobtweets_df = twint.storage.panda.Tweets_df
len(jobtweets_df)

In [None]:
jobtweets_df.sample(5)

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,hashtags,cashtags,user_id,user_id_str,username,name,day,hour,link,retweet,nlikes,nreplies,nretweets,quote_url,search,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
586,1286421473658376193,1286421473658376193,1595541754000,2020-07-23 22:02:34,UTC,,Looking for more job opportunities? Check out ...,"[#ikokazike, #ikokazi, #hiring, #jobopening]",[],958628343192211456,958628343192211456,KaziQuest,KaziQuest,4,22,https://twitter.com/KaziQuest/status/128642147...,False,0,0,1,,#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kaz...,,,,,,,"[{'user_id': '958628343192211456', 'username':...",,,,
1476,1285222858042507264,1285222858042507264,1595255982000,2020-07-20 14:39:42,UTC,,Taking #sportsmanagement to the next level.#Lo...,"[#sportsmanagement, #logodesign, #knuckleball,...",[],1214891354628722688,1214891354628722688,SnaapMediaKe,Snaap Media Kenya,1,14,https://twitter.com/SnaapMediaKe/status/128522...,False,1,0,1,,#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kaz...,,,,,,,"[{'user_id': '1214891354628722688', 'username'...",,,,
4826,1280038331045339136,1280038331045339136,1594019894000,2020-07-06 07:18:14,UTC,,A perfect smile guaranteed.\n https://buff.ly/...,"[#utmostdentalcare, #ikokazike]",[],732915352577937408,732915352577937408,thenbodentist,The Holistic Nairobi Dentist,1,7,https://twitter.com/thenbodentist/status/12800...,False,1,0,0,,#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kaz...,,,,,,,"[{'user_id': '732915352577937408', 'username':...",,,,
5733,1277926083954741248,1277926083954741248,1593516295000,2020-06-30 11:24:55,UTC,,Kazi - 夕陽（short ver.） - at なんば湊町リバープレイス https...,"[#kazi, #カジ, #夕陽, #loveyouthfulpark, #路上ライブ, #...",[],138098669,138098669,matsudashozo,松田ショウゾウ/matsudashozo,2,11,https://twitter.com/matsudashozo/status/127792...,False,1,0,0,,#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kaz...,,,,,,,"[{'user_id': '138098669', 'username': 'matsuda...",,,,
1517,1285189095724113925,1285189095724113925,1595247932000,2020-07-20 12:25:32,UTC,,Job Hunting can be emotionally draining . \n\n...,"[#ikokazike, #sakajaaplologises]",[],362373842,362373842,Simon_Ingari,Simon Ingari,1,12,https://twitter.com/Simon_Ingari/status/128518...,False,0,0,1,,#IkoKazi OR #IkoKaziKe OR #jobsinkenya OR #Kaz...,,,,,,,"[{'user_id': '362373842', 'username': 'Simon_I...",,,,


In [None]:
#Filter specific columns
jobs = jobtweets_df[['id','date','tweet','hashtags','username','name','day','hour','retweet','nlikes','nretweets','reply_to']]
#Drop duplicates and export to CSV
jobs = jobs.drop_duplicates(subset='id', keep='first')
jobs.to_csv('jobs.csv', index=False)

#### Observations

1. Tweets mainly in English and some in Kiswahili

2. Tweets about job adverts. Examples: 

New post: Career Opportunity at Safal Group  https://www.careerpoint-solutions.com/career-opportunity-at-safal-group/ … #IKoKazi #IkoKaziKe #design-manager-jobs

Looking for more job opportunities? Check out the Customer Care (Support Team) Professional Job
Click here to apply  http://app.kaziquest.com/jobs/customer-care-support-team-professional/ … #IkoKaziKE #IkoKazi #Hiring #JobOpening pic.twitter.com/XgV9E55PpC


3. Product/service advertisement tweets e.g 

"Hey Twitter fam,kindly follow us on IG and Fb @uzima_foods_groceries.We deliver fresh fruits right to your doorstep.We also have gift fruit baskets and can surprise your loved ones on your behalf. 📲Dm/call/Whatsapp us on +254723602737 for more enquires and orders.#IkoKaziKE  pic.twitter.com/aHXAA06MrG"

Have you booked dental appointment for this week? Call us today for a dental  consultation.⠀ Please feel #Free to contact us 254725526047/+254732690149⠀NB:(Price will be quoted separately after consultation)⠀#UtmostDentalCare #oralhealth #dentalhygienist #IkoKaziKE  pic.twitter.com/mSLG9uCwBM

Van's on Sale 1800 Size 39-40 we deliver also. All available. #MainaAndKingangi #IkoKaziKE #MTVHottest #Kenya #maishacountdown #Alexa pic.twitter.com/7LsNPtt7rs

HP ProBook 430 g3 * Intel Core i5 (6th Gen) * 4GB RAM *500 gb harddisk * 13.3" kshs 28,500 ☎0707311340 #KameneAndJalas #MwashumbeNaShugaboy #BarakaZaMilele #AlexnaTrickyMilele #1Man1Vote1Shilling #kiambu turkana Mombasa railways murkomen #gidinaghostasubuhi #IkoKaziKE  pic.twitter.com/gxtGkSK3gE

MondayMorning  Hi! We do Graphic Design | Web Design | App Development | Logo Design  and  Printing.  Get in  touch via DM or WhatsApp  https://wa.me/c/254**** . #IkoKaziKE #1man1vote #ikokazi  pic.twitter.com/7bGlHh1hVZ

3. URL link to job advert. Job not mentioned explicitly e.g  

"New post: Latest Microsoft Careers  https://www.careerpoint-solutions.com/latest-microsoft-careers/ … #IKoKazi #IkoKaziKe #microsoft-jobs"

4. People sharing their profile for potential employer

It's MONDAY folks. I had major stakes in the events/entertainment world. But since COVID happened it's been a hustle. 
Qualifications:- Bachelor of Commerce. Exp:- 8 years plus Banking & Finance | Credit Management | Customer Experience. RT my boss might be here. #IkoKaziKe

Wekeni rts apo Anyone looking for a driver ndio huyu apa. #IkoKaziKe #IkoKaziKe  https://twitter.com/McOgutu_Taller/status/1286201149557800960 …

.#ikokazike Am a professional voice over artist with over 5 yes experience.I have done several projects ranging from TV and radio Ads,explainers,animations,feature stories , CBC content.Kindly contact me for your next project.

5. Sharing social information

Apply For Nyali Driving School Bursary Fund -  https://opportunitiesforyoungkenyans.co.ke/2020/07/27/apply-nyali-driving-school-bursary-fund/ … #IkoKaziKe #PataKaziKe #Gethired #JobSeek... pic.twitter.com/ZCAEiqPzTN

# Data Cleaning

Import the dataset that has combined tweets from all contributors

In [None]:
job_tweets = pd.read_csv('/content/twitter-job-hunter-chatbot.csv',engine='python') 
job_tweets.sample(5)

Unnamed: 0,Datetime,Text,Source,harsh tag,Favourite Count,Retweets,6,7,submitter_name
29596,28/06/2020 11:26,Incase you missed this sad news- @FredMatiangi...,Wangechi Gitahi Travels,2.0,2.0,,,,Kennedy Njoroge
8349,2020-07-17 21:06:22+00:00,Amazon Corporate LLC is #hiring a Senior Finan...,joblify_app,,0.0,0.0,0.0,0.0,Eric Nzivo
12073,2020-07-02 07:32:52+00:00,Come and join us! We have a #vacancy for a 3-y...,Ettema_lab,,189.0,154.0,0.0,0.0,Eric Nzivo
15185,2020-06-15 09:54:19+00:00,Library Apprentice - #Advanced #Apprenticeshi...,SkillUpSomerset,,2.0,5.0,0.0,0.0,Eric Nzivo
13932,2020-06-22 17:49:05+00:00,#vacancy,davygbahou,,0.0,0.0,0.0,0.0,Eric Nzivo


In [None]:
test = ['Text']

In [None]:
test

['Text']