# Task Objective:
    Your mission is to design and implement an Information Retrieval (IR) system capable of efficiently
    retrieving relevant documents from a given dataset. Below are the key steps and details for this project:

# Methodology

###  1. Dataset Selection and Preparation

###  2. Data Preprocessing
###  3. User Query Interface
###  4. Retrieval Algorithm
###  5. Query Processing
###  7. User Feedback (Optional)

###  6. Evaluation
###  8. Documentation and Presentation


# 1. Dataset Selection and Preparation:
  - This Dataset is scraped from https://www.thenews.com.pk website. It has news articles from 2015 till date related to     business and sports. It Contains the Heading of the particular Article, Its content and its date. The content also contains the place from where the statement or Article was published.
  - you can download this dataset from kaggle( https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles)
  
### About this csv file
  Data set contains 4 columns

 - Article : Text having the news article and the place where it was published from
- Heading : Text containing the heading of the news article.
- Date : Date when the article was published.
- NewsType : Type of Article i.e business or sports

## Importing Libraries 

In [64]:
## Importing necessary libraries
import pandas as pd
import numpy as np
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords ,wordnet
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt

In [65]:
## Import the dataframe using pandas 
df = pd.read_csv("./dataset/Articles.csv", encoding='latin1')
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [66]:
## Take the rows only necessary
data = df.drop(["Date" ,"Heading"] ,axis = 1)
data.head()

Unnamed: 0,Article,NewsType
0,KARACHI: The Sindh government has decided to b...,business
1,HONG KONG: Asian markets started 2015 on an up...,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,business
4,NEW YORK: US oil prices Monday slipped below $...,business


# 2. Data Cleaning & Preprocessing
  - Check for missing value
- Remove duplicates text
- Casing
- Removing puntuation

In [67]:
## Check for missing values
data.isna().sum()

Article     0
NewsType    0
dtype: int64

In [68]:
## Check for duplicates
print("Duplicates ",data.duplicated().sum())

data.drop_duplicates(inplace = True) ## drop the duplicate

print("After dropping duplicates" , data.duplicated().sum())

Duplicates  108
After dropping duplicates 0


In [69]:
## Checking for duplicates in article Column
data.duplicated(subset = ["Article"]).sum()
data.head()

Unnamed: 0,Article,NewsType
0,KARACHI: The Sindh government has decided to b...,business
1,HONG KONG: Asian markets started 2015 on an up...,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,business
4,NEW YORK: US oil prices Monday slipped below $...,business


In [70]:
data.tail()

Unnamed: 0,Article,NewsType
2669,strong>DUBAI: Dubai International Airport and ...,business
2670,"strong>BEIJING: Former Prime Minister, Shaukat...",business
2671,strong>WASHINGTON: Uber has grounded its fleet...,business
2690,strong>BEIJING: The New Development Bank plans...,business
2691,strong>KARACHI: Karachi-based technology incub...,business


In [71]:
import nltk
nltk.download('stopwords')
import nltk
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Barcha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Barcha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [72]:
## Create a preprocessing function
stops_word = set(stopwords.words("english")) ## Will contain stops words

def preprocessing(text):
    text = text.lower()
    text = re.sub(r'https\S+|www\S+https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#', '', text)
    text = re.sub(r'[^\w\s\n]', '', text)
    text = re.sub(r'<br>|<strong>', '', text)
    
    lemitizer = WordNetLemmatizer()  ## this function converts the word to its base form
    words = word_tokenize(text) ## split the sentence into words/tokens
    lemitize_word = [lemitizer.lemmatize(word ,wordnet.VERB) for word in words]
    newArray = [stop_word for stop_word in lemitize_word if stop_word not in stops_word]
    
    return " ".join(newArray)

    

In [73]:
## Apply preprocessing
data["ArticleCleaned"] = data["Article"].apply(preprocessing)

In [74]:
# check the preprocessed data
data["ArticleCleaned"]

0       karachi sindh government decide bring public t...
1       hong kong asian market start 2015 upswing limi...
2       hong kong hong kong share open 066 percent low...
3       hong kong asian market tumble tuesday follow p...
4       new york us oil price monday slip 50 barrel fi...
                              ...                        
2669    strongdubai dubai international airport flag c...
2670    strongbeijing former prime minister shaukat az...
2671    strongwashington uber ground fleet selfdriving...
2690    strongbeijing new development bank plan cofina...
2691    strongkarachi karachibased technology incubato...
Name: ArticleCleaned, Length: 2584, dtype: object

In [75]:
data["NewsType"].value_counts()

sports      1408
business    1176
Name: NewsType, dtype: int64

# 3. User Query Interface
  - create a query function and preprocess it by using preprocess function from above

In [76]:
## create a input query function

def query():
    query = input("Write the query(Text) :\n ")
    
    query1 = preprocessing(query)  
    
    return query1

#  4. Retrieval Algorithm
 - Use cosine_simularity to find the text for the query and print the most likely news article for it

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def algo():
    vectorizer = TfidfVectorizer()   ## cont

    x = data["ArticleCleaned"]

    x_vectorize = vectorizer.fit_transform(x)

    query111 = query()
   
    query_vectorizer = vectorizer.transform([query111])


    similarity = cosine_similarity(query_vectorizer ,x_vectorize)
    return similarity ,query11


## Run below cell to query

In [81]:
## Search the text for the most relevant match
similarity, query11 = algo()
max_text_index = similarity[0].argmax()

print()
print(f"\nThe index is: {max_text_index}\n")
print(data["Article"].iloc[max_text_index])

Write the query(Text) :
 KARACHI: The Sindh government has decided to


The index is: 871

strong>ISLAMABAD: Sindh Chief Minister Syed Murad Ali Shah has said that World Bank has played an important role in the development of education, health, agriculture and infrastructure sectors in Sindh. "The urge for development of Sindh has brought me here at the WB Country office."</strongThis he said while talking to WB Country Director Mr Illang Patchamuthu at WB Country office where he had a luncheon-meeting with him today.The chief minister said that the total development portfolio of WB in Sindh was $1.14 billion which include Education, health, irrigation, Agriculture, skill development and other sectors.Syed Murad Ali Shah discussed Karachi mega projects which include infrastructure development of the city, water supply and sanitation projects. "Sindh government needs financial and technical assistance for implementation of these projects," he told the World Bank country director.He urge

## First 5 query matches

In [82]:
ranked_indices = np.argsort(similarity[0])[::-1]
 
ranked_documents = [data["Article"].iloc[idx] for idx in ranked_indices]  

In [83]:
for i ,news in enumerate(ranked_documents):
    print(i)
    print(news)
    print(100*"-")
    print("\n")
    if i == 4:
        break
    

0
strong>ISLAMABAD: Sindh Chief Minister Syed Murad Ali Shah has said that World Bank has played an important role in the development of education, health, agriculture and infrastructure sectors in Sindh. "The urge for development of Sindh has brought me here at the WB Country office."</strongThis he said while talking to WB Country Director Mr Illang Patchamuthu at WB Country office where he had a luncheon-meeting with him today.The chief minister said that the total development portfolio of WB in Sindh was $1.14 billion which include Education, health, irrigation, Agriculture, skill development and other sectors.Syed Murad Ali Shah discussed Karachi mega projects which include infrastructure development of the city, water supply and sanitation projects. "Sindh government needs financial and technical assistance for implementation of these projects," he told the World Bank country director.He urged the World Bank country head to send his team to Sindh to discuss these projects with th

  ## Conclusion 
  - As you cn see from above that successful query and related documents are shown above.