# Instagram scraper + Mongodb storage
The aim of this notebook is to implement an instagram scraper that collects posts (images, texts and comments) for a user specified research query. And store the uploaded data into mongodb using pymongo python package.

## Scraping data from instagram
I used instagram-scraper package in order to upload data from multiple user profiles (lemondefr, lefigarofr, franceinfo, etc). The user can specify the wanted profiles in the 'ig_users.txt' file.

To lunch the import of data, the user needs to lunch the following command in the shell:<br>
instagram-scraper --filename ig_users.txt --comments --media-types image --maximum 10 -u USERNAME -p PASSWORD
 
For more detailed command, you can use 'collecting_data.py' python file by specifying the data_dir variable, where the downloaded data will be stored and setting the variables 'IG_USERNAME' and 'IG_PASSWORD' in a .env file. Finally, you can execute collecting_data.py

## Processing the downloaded data
The aim of this part is to filter the downloaded data, and only keep the results corresponding to the user research.

I followed these steps:
<ul>
  <li>Only keep key_words from the user research by removing punctuation and stopwords.</li>
  <li>Only keep posts where text contains at least one keyword.</li>
</ul>

In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#research_request = "le décès du président Jacques Chirac."
#research_request = "covid France"
research_request = "médaille JO"

# keeping the key_words
tokens = word_tokenize(research_request)
tokens = [word for word in tokens if not word in stopwords.words("french")]
research_key_words = [word for word in tokens if word.isalpha()]
print(research_key_words)

['médaille', 'JO']


In [14]:
import glob
import json
import re

# concatenating all uploaded posts
data = {'GraphImages': []}
for json_file in glob.glob(r"./data/*.json"):
    with open(json_file, "r", encoding='utf-8') as f: 
        local_data = json.load(f)
        local_posts = local_data['GraphImages']
        data['GraphImages'].extend(local_posts)

print(f"total number of uploaded posts is :{len(data['GraphImages'])}")       

total number of uploaded posts is :50


In [15]:
wanted_posts_index = []

# filtering the posts using the key_words
posts = data['GraphImages']
for i in range(len(posts)):
    for key_word in research_key_words:
        if re.findall(key_word, posts[i]['edge_media_to_caption']['edges'][0]['node']['text']):
            wanted_posts_index.append(i)
            break
            
data['GraphImages'] = [posts[ind] for ind in wanted_posts_index]
print(f"Numbert of filtered posts is :{len(data['GraphImages'])}")  

Numbert of filtered posts is :4


### Possible improvements:
<ul>
  <li>We can add a counter and only keep posts where the text contains n key_words</li>
  <li>We can look for those key_words in the user_comments too</li>
</ul>

## Storing Data in Mongodb
The aim of this part is to store the chosen posts into a local mongodb database. Mongodb is a NoSql Database which enables the storage of data as documents inside Collections.

I will store the posts data (in the json semi-structured format) in a Collection 'posts', and images in a Collection 'images'.

In a real business project, the choice of the database model rely heavily on how the data will be used. (number and frequency of queries like inserts, updates, etc)

In [16]:
from pymongo import MongoClient

client = MongoClient(host='localhost', port=27017)

# creating database if it doesn't exists 
db = client['scraping_db']

# creating collections
post_col = db.posts

# inserting the chosen posts into the posts collection
res = post_col.insert_many(data['GraphImages'])

In [17]:
# inserting the images into the images collection
import gridfs

#Create an object of GridFs for the above database.
gf = gridfs.GridFS(db)

for image in glob.glob(r"./data/*.jpg"):
    # opening the image in read-only binary format.
    with open(image, 'rb') as f:
        contents = f.read()

    # storing the image via GridFs object.
    fs.put(contents, filename=image)