Instadata - A holistic Python Instagram Scraper

integrating with MongoDb, sitarealemail.com, nltk, locator (nominatim) instagram_private_api and instagrapi

See a demo of the data HERE

What this Scraper can do:

Collecting Keywords. Hashtags, E-Mail-Addresses, Phone-Numbers
Gender, Name and Language analysis
Scrape linktree Profiles and capture more links, identify custom linked websites and social profiles and check if the links given actaully work.
Check if this profile is a bot
Collect user information like the biography, profile picture, full name, and more, both in its original form and normalized (abstract fonts will be converted to normal letters)
Use Proxies on all requests
Split the load up to as many accounts as possible

Warning: This Scraper is fully automatic, however, if you overdo the scraping, ratelimits can occur and get your Account blocked (you will need to manuaally verify yourself through email or a phone number).

Info. This is the basis of the IDWW Package. IDWW (InstaData Web Wrapper) offers a upgraded version of this package that uses django for a user interface

How this scraper works

execute the program with python3 main.py
When properly configured, the scraper will first connect to all given accounts through two of the best instagram api wrappers. It will also use proxies, if any are specified and work. When either an account or a proxy connection don't work, the scraper will continue, but inform you through a warning.
After that, other initialization work, such as finding starting users will be completed
The Program will now start scraping: It will scrape, analyze, and save all data to the database directly. Afterwards it will sleep for how SLEEP_TIME seconds. That means you can see if accounts are being scraped in real time in log.csv or through the MongoDB Compass application.
With a chance of 1:1750, the program will enter a long sleeping phase to decrease odds of being ratelimited or blocked by Instagram's anti-scraping-measures. It will sleep in seconds for a random number between the range specified in LONG_SLEEP_TIME
It will also reconnect to the MongoDB database
Midnights (if specified in ANALYZE_PREVENTION[0]) the program will reconnect the accounts, sleep for ANALYZE_PREVENTION[1] seconds, and reconnect the accounts

How to start

Info

Variables which are written in CAPSLOCK can be changed to customize the program

Must Do before starting

install all packages from requirements.txt
in requirements.txt you wil find two commented lines: Install the instagram_private_api package through the github link. When trying to run the scraper, you will likely see error messages from nltk. Use the given python commands to install the required nltk sub-packages. This is required for text analysis.
download MongoDB on your pc and configure a mongodb link. This is the integrated, and strongly preferred Database. Recommened: Install MondoDB Compass
create a database and a collection in your database (Using MongoDB Compass is advised)
configure the Database variables, such as the database name in modules/mongomanager.py
visit sitarealemail.com and get an API Link. Paste this Api-key in the variable API_KEY in modules/mailhandler.py
populate ACCOUNTS_DATA with valid instagram accounts that will be used to access Instagram. Example: ("username123", "password321")
We highly encourage you to use proxies to increase privacy. Paste working proxies into the resources/proxies.csv file (a list of public proxies can be found inside resources/proxies3.csv) (see the structuring in resources/proxies2.csv: connection,port,latency,uptime,location_index. The last 3 values are used to sort the proxies by which is best. If you don't have this information, use 1 as default value)

Optional Configuration

If you want, you can configure a backup connection in case something happens to the primary one. If not, leave the backup connection out
configure variables such as SLEEP_TIME, LONG_SLEEP_TIME or USERMAX: We advise you not to change the variables too much as we had success with the base configuration
SLEEP_TIME should not be below 5 (seconds)

log system

The Scraper has two kinds of logs: the Stdout (logged to console via print) and the main logger that logs status and time to complete after a profile was scraped

This is how the log.csv messages are sturctured: function status | user id | layer | current time | time to complete status and current time required

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
__pycache__		__pycache__
logs		logs
modules		modules
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

DavideWiest/instadata

Folders and files

Latest commit

History

Repository files navigation

Instadata - A holistic Python Instagram Scraper

See a demo of the data HERE

What this Scraper can do:

Collecting Keywords. Hashtags, E-Mail-Addresses, Phone-Numbers

Gender, Name and Language analysis

Scrape linktree Profiles and capture more links, identify custom linked websites and social profiles and check if the links given actaully work.

Check if this profile is a bot

Collect user information like the biography, profile picture, full name, and more, both in its original form and normalized (abstract fonts will be converted to normal letters)

Use Proxies on all requests

Split the load up to as many accounts as possible

Warning: This Scraper is fully automatic, however, if you overdo the scraping, ratelimits can occur and get your Account blocked (you will need to manuaally verify yourself through email or a phone number).

Info. This is the basis of the IDWW Package. IDWW (InstaData Web Wrapper) offers a upgraded version of this package that uses django for a user interface

How this scraper works

How to start

Info

Must Do before starting

Optional Configuration

log system

The Scraper has two kinds of logs: the Stdout (logged to console via print) and the main logger that logs status and time to complete after a profile was scraped

Inquiries and Suggestions at davide.wiest2@gmail.com

I am a busy Python (Web) Developer that can work on a freelance basis.

About

Resources

License

Stars

Watchers

Forks

Languages