# Task Objective:
    Your mission is to design and implement an Information Retrieval (IR) system capable of efficiently
    retrieving relevant documents from a given dataset. Below are the key steps and details for this project:

# Methodology

###  1. Dataset Selection and Preparation

###  2. Data Preprocessing
###  3. User Query Interface
###  4. Retrieval Algorithm
###  5. Query Processing
###  7. User Feedback (Optional)

###  6. Evaluation
###  8. Documentation and Presentation


# 1. Dataset Selection and Preparation:
  - This Dataset is scraped from https://www.thenews.com.pk website. It has news articles from 2015 till date related to     business and sports. It Contains the Heading of the particular Article, Its content and its date. The content also contains the place from where the statement or Article was published.
  - you can download this dataset from kaggle( https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles)
  
### About this csv file
  Data set contains 4 columns

 - Article : Text having the news article and the place where it was published from
- Heading : Text containing the heading of the news article.
- Date : Date when the article was published.
- NewsType : Type of Article i.e business or sports

## Importing Libraries 

In [70]:
## Importing necessary libraries
import pandas as pd
import numpy as np
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords ,wordnet

In [71]:
## Import the dataframe using pandas 
df = pd.read_csv("./dataset/Articles.csv", encoding='latin1')
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [72]:
## Take the rows only necessary
data = df.drop(["Date" ,"Heading"] ,axis = 1)
data.head()

Unnamed: 0,Article,NewsType
0,KARACHI: The Sindh government has decided to b...,business
1,HONG KONG: Asian markets started 2015 on an up...,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,business
4,NEW YORK: US oil prices Monday slipped below $...,business


# 2. Data Cleaning & Preprocessing
  - Check for missing value
- Remove duplicates text
- Casing
- Removing puntuation

In [73]:
## Check for missing values
data.isna().sum()

Article     0
NewsType    0
dtype: int64

In [74]:
## Check for duplicates
print("Duplicates ",data.duplicated().sum())

data.drop_duplicates(inplace = True) ## drop the duplicate

print("After dropping duplicates" , data.duplicated().sum())

Duplicates  108
After dropping duplicates 0


In [75]:
## Checking for duplicates in article Column
data.duplicated(subset = ["Article"]).sum()
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Barcha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [76]:
## Create a preprocessing function
stops_word = set(stopwords.words("english"))

def preprocessing(text):
    text = text.lower()
    text = re.sub(r"https\S+|www\S+https\S+",'',text ,flags = re.MULTILINE)
    text = re.sub(r'\@w+|\#','',text)
    text = re.sub(r'[^\w\s\n]','',text)
    
    lemitizer = WordNetLemmatizer()
    words = text.split()
    lemitize_word = [lemitizer.lemmatize(word ,wordnet.VERB) for word in words]
    newArray = [stop_word for stop_word in lemitize_word if stop_word not in stops_word]
    
    return " ".join(newArray)

    

In [77]:
## Apply preprocessing
data["Article"] = data["Article"].apply(preprocessing)

In [80]:
# check the preprocessed data
data["Article"]

0       karachi sindh government decide bring public t...
1       hong kong asian market start 2015 upswing limi...
2       hong kong hong kong share open 066 percent low...
3       hong kong asian market tumble tuesday follow p...
4       new york us oil price monday slip 50 barrel fi...
                              ...                        
2669    strongdubai dubai international airport flag c...
2670    strongbeijing former prime minister shaukat az...
2671    strongwashington uber ground fleet selfdriving...
2690    strongbeijing new development bank plan cofina...
2691    strongkarachi karachibased technology incubato...
Name: Article, Length: 2584, dtype: object

In [85]:
data["NewsType"].value_counts()

sports      1408
business    1176
Name: NewsType, dtype: int64

# 3. User Query Interface
  - create a query function and preprocess it by using preprocess function from above

In [88]:
## create a input query function

def query():
    query = input("Write the query(Text) : ")
    print("The query before preprocessing: " ,query)
    query = preprocessing(query)
    
    print("\nThe query after preprocessing: " ,query)
    
    return query

In [89]:
query()

Write the query(Text) : As a writer, this phase serves as the most exciting phase, as everything is already laid down carefully and you get to concretize the things you have already laid down. This phase is indeed crucial for the things you have organized prior to the writing phase.   As a golden rule, always remember to write as you think. Get to be friends with your mind as this plays a huge role in the writing phase. Remember to be as dynamic as you can but keep referencing the base document to see the sub-points and the information you want under each point. You can play words, but always make sure that you pitch the attention of the readers by making use of relevant and accurate pieces of information.  There may be many drafts in your article, it is always important that you aim for effective positioning to readers and comprehensive dissemination of information.  In the actual writing phase, make sure to do these things:  Write a convincing opening paragraph. Making an engaging op

'writer phase serve excite phase everything already lay carefully get concretize things already lay phase indeed crucial things organize prior write phase golden rule always remember write think get friends mind play huge role write phase remember dynamic keep reference base document see subpoints information want point play word always make sure pitch attention readers make use relevant accurate piece information may many draft article always important aim effective position readers comprehensive dissemination information actual write phase make sure things write convince open paragraph make engage open paragraph attract attention readers start engage statement everything follow open statement important set attention readers include keywords use remarkable keywords retain attention readers also effectively let readers know aim urge readers take action may conclude article compel call action urge respond leave comment ask question tell thoughts upon read content write active voice mini

#  4. Retrieval Algorithm