Shami Dialect Corpus (SDC)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Data
Read Arabic File/readarabicfile
Twitter-API
Twitter-hashtags
twiiter-geolocation
twitter-file
LICENSE
README.md
requirements.txt
tweets python.txt

README.md

This code is written by : Motaz_sassd

tweets-collector

Collect tweets (tweets corpus) using Twitter API.

Collection can be based on hashtags or keywords.

install requirements

pip install -r requirements

Getting your API keys from Twitter

  1. Go to https://apps.twitter.com and create an new app twitter apps

  2. Provide a name and describe for the app, then specify permissions app info

  3. Then go to keys and access management tab app keys

  4. put these info in credentials.txt and in api_keys.py files.

query_tweets.py Usage

usage: query_tweets.py [-h] -k KEYWORDS_FILE -o OUTFILE -n NUMBER

collect tweets based on keywords

optional arguments:
  -h, --help            show this help message and exit
  -k KEYWORDS_FILE, --keywords-file KEYWORDS_FILE
                        keywords or hashtags file. The file should contain one
                        keyword/hashtag per line
  -o OUTFILE, --outfile OUTFILE
                        the output json file path and prefix.
  -n NUMBER, --number NUMBER
                        the number of tweets that you want to collect


json2text.py Usage

usage: json2text.py [-h] -i JSON_DIR -o OUT_DIR

extract tweet texts from json

optional arguments:
  -h, --help            show this help message and exit
  -i JSON_DIR, --json-dir JSON_DIR
                        tweets json directory
  -o OUT_DIR, --out-dir OUT_DIR
                        the output directory.

stream_geolocation.py Usage

Get Geo locations from http://boundingbox.klokantech.com/

usage: stream_geolocation.py [-h] -l GEO_LOCATIONS -j JSON -n NUMBER

collect tweets based on geographic location

optional arguments:
  -h, --help            show this help message and exit
  -l GEO_LOCATIONS, --geo-locations GEO_LOCATIONS
                        geo location coordinates from
                        http://boundingbox.klokantech.com copy and past using 
                        csv option
  -j JSON, --json JSON  the the json output file.
  -n NUMBER, --number NUMBER
                        the number of tweets that you want to collect


stream_users.py Usage

Get users id from https://tweeterid.com

usage: stream_users.py [-h] -u USERS -j JSON -n NUMBER

collect tweets based on following twitter users

optional arguments:
  -h, --help            show this help message and exit
  -u USERS, --users USERS
                        twitter user ids file. Get ids from tweeterid.com
  -j JSON, --json JSON  the the json output file.
  -n NUMBER, --number NUMBER
                        the number of tweets that you want to collect

Some pre processing in Java code

use the java file inside Rad Arabic File corpus for Reading the txt file process the Repeated characters Normalize the .txt file