NLP and Web Scraping Lab by Omar Nouih

This repository contains a demonstration of web scraping techniques, Natural Language Processing (NLP) pipeline, and various NLP tasks such as text cleaning, tokenization, stop words removal, discretization, normalization, stemming, lemmatization, parts-of-speech tagging, and Named Entity Recognition (NER) for Arabic language text.

Objective

The main objective of this project is to demonstrate proficiency in web scraping and NLP techniques for Arabic language text. By scraping data from a website related to census processes in Morocco and applying various NLP tasks, we aim to showcase skills in data acquisition, preprocessing, and analysis specific to the Arabic language.

Tasks

Web scraping from www.candidature-recensement.ma to extract information about census conditions, steps, compensation, tasks, and frequently asked questions.
Storing the raw data in MongoDB for further processing and retrieval.
Preprocessing the Arabic text data through tokenization, stop words removal, normalization, stemming, lemmatization, and other techniques.
Performing parts-of-speech tagging and Named Entity Recognition (NER) using Farasa library for Arabic language text.

Libraries Used

requests: For sending HTTP requests to the website and fetching the HTML content.
Beautiful Soup: For parsing the HTML content and extracting relevant data from the website.
pymongo: For interacting with the MongoDB database to store and retrieve the scraped data.
nltk: For various NLP tasks such as tokenization, stop words removal, stemming, and lemmatization.
qalsadi: For Arabic lemmatization.
farasa: For Arabic language processing tasks including parts-of-speech tagging and Named Entity Recognition (NER).

Web Scraping

We utilized the mentioned libraries in Python to scrape data from the www.candidature-recensement.ma website. The scraped data includes information about census conditions, steps, compensation, tasks, and frequently asked questions.

Storing Raw Data

We stored the raw scraped data in a MongoDB database named "NLP" and a collection named "atelier". This allows for easy retrieval and further processing of the data.

NLP Pipeline

The NLP pipeline involves several preprocessing steps such as tokenization, stop words removal, normalization, stemming, and lemmatization. We applied these techniques to the scraped Arabic text data to prepare it for analysis. Additionally, we performed parts-of-speech tagging and Named Entity Recognition (NER) using the Farasa library, specifically designed for Arabic language processing.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ARABIC_NLP_LAB.ipynb		ARABIC_NLP_LAB.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP and Web Scraping Lab by Omar Nouih

Table of Contents

Objective

Tasks

Libraries Used

Web Scraping

Storing Raw Data

NLP Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP and Web Scraping Lab by Omar Nouih

Table of Contents

Objective

Tasks

Libraries Used

Web Scraping

Storing Raw Data

NLP Pipeline

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages