<a href="https://colab.research.google.com/github/Eitams/NLP_Exercise/blob/main/NLP_extract_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text analysis - Bacteria VS Bactriophage
The following two part notebook is part of an NLP course exercise aiming to analyse data extracted from wikipedia based on text analysis parameters.  
In the first notebook the data was extracted from wikipedia and in the second notebook an exploration of the two extracted data classes was performed.

## Intro
Bacteria are small single-celled microorganisms, which were among the first life forms to appear on earth and are present in most of its habitats. Though some bacteria may be pathogenic to human, most bacteria are not, and many bacteria occupy different sites in the human body and perform essential roles in body maintenance. In fact, the human body is estimated to contain about 10 times more bacterial cells than human cells.  
  
The natural enemies of bacteria in the evolutionary race are bacteriophages. A bacteriophage (often abbreviated to “phage”) is a virus that infects and replicates within bacteria. 

Notebooks author: Eitam Shafran
## Part 1 - Extracting data from wikipedia

In [None]:
## Install wikipedia api
!pip install wikipedia-api

Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.5.4-py3-none-any.whl size=13477 sha256=bcac39e72ec103e14db0d48fdbfe41c1ab193937cf164d96af90ae8becb5a7ae
  Stored in directory: /root/.cache/pip/wheels/d3/24/56/58ba93cf78be162451144e7a9889603f437976ef1ae7013d04
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.5.4


In [None]:
## Mount drive
from google.colab import drive
drive.mount('/content/drive')
#import os
#os.chdir('/content/drive/MyDrive/NLP')

Mounted at /content/drive


In [None]:
import wikipediaapi
import pandas as pd

## Set up environment
pd.set_option('display.max_columns', None)
#pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

In [None]:
# Define language and page
wiki = wikipediaapi.Wikipedia('en')

# collect wiki page data in a variable 
bacteria = wiki.page("List_of_clinically_important_bacteria")
bacteriophage = wiki.page("Category:Bacteriophages")
# viruses = wiki.page("List_of_virus_species") ## Was not used in the in the final dataframe

In [None]:
'''
Function- extreact links from wikipedia bacteria page.
The function filter the bacteria pages links from the other links in the page
and returns only them with a short description as a dictionary.
*The page is not a category
'''
def bacteria_link_collector(category):
  ## The bacteria pages section titles are a letter- we used this to find the names of the wanted pages
  import re
  category_names =[]
  pattern = re.compile("[A-Za-z]")
  for i in category.sections:
    if pattern.fullmatch(i.title): ## if section name= letter
      tmp=i.text.split('\n')
      for name in tmp:
        category_names.append(name) ## add name of links under section to list
  
  ## Extract the summary of all links which exist in the category_name list and have data inside
  mdict = {}
  links = category.links
  for c in links.values():
     if c.ns == 0 and c.title in category_names and len(c.summary[0:1])>0:                                                            # Exclude categories within the category
        mdict[c.title] = c.summary ## Assign to dict
  return mdict

In [None]:
# Get members for a category together with a short description
def Bacteriophage_collector(category):
      mdict = {}
      categorymembers = category.categorymembers
      for c in categorymembers.values():
        if c.ns == 0:                                                            # Exclude categories within the category
          mdict[c.title] = c.summary
      return mdict

In [None]:
# Get all links in a list together with a short description
def virus_link_collector(category):
  mdict = {}
  links = category.links
  for c in links.values():
     if c.ns == 0 and len(c.summary[0:1])>0:                                                            # Exclude categories within the category
        mdict[c.title] = c.summary
  return mdict


In [None]:
bacteria_dic = bacteria_link_collector(bacteria)
bacteriophage_dic = Bacteriophage_collector(bacteriophage)
#virus_dic = virus_link_collector(viruses)

In [None]:
# Transform dictionaries to dataframes 

# Bacteria
bacteria_df = pd.DataFrame([bacteria_dic.keys(), bacteria_dic.values()]).T
bacteria_df.columns = ['Name', 'Description']
bacteria_df_style = bacteria_df.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

# bacteriophage
bacteriophage_df = pd.DataFrame([bacteriophage_dic.keys(), bacteriophage_dic.values()]).T
bacteriophage_df.columns = ['Name', 'Description']
bacteriophage_df_style = bacteriophage_df.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

# Viruses
#viruses_df = pd.DataFrame([virus_dic.keys(), virus_dic.values()]).T
#viruses_df.columns = ['Name', 'Description']
#viruses_df_style = viruses_df.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

In [None]:
bacteria_df.head()

Unnamed: 0,Name,Description
0,Frateuria aurantia,Frateuria aurantia is a species of bacteria. It is named after the Belgian microbiologist Joseph Frateur. The cells are mostly straight rods. Frateuria aurantia was isolated from the plant Lilium auratum and from the fruit of the raspberry Rubus parvifolius. It is a potassium solubilizing bacteria.
1,Acinetobacter baumannii,"Acinetobacter baumannii is a typically short, almost round, rod-shaped (coccobacillus) Gram-negative bacterium. It is named after the bacteriologist Paul Baumann. It can be an opportunistic pathogen in humans, affecting people with compromised immune systems, and is becoming increasingly important as a hospital-derived (nosocomial) infection. While other species of the genus Acinetobacter are often found in soil samples (leading to the common misconception that A. baumannii is a soil organism, too), it is almost exclusively isolated from hospital environments. Although occasionally it has been found in environmental soil and water samples, its natural habitat is still not known.\nBacteria of this genus lack flagella, whip-like structures many bacteria use for locomotion, but exhibit twitching or swarming motility. This may be due to the activity of type IV pili, pole-like structures that can be extended and retracted. Motility in A. baumannii may also be due to the excretion of exopolysaccharide, creating a film of high-molecular-weight sugar chains behind the bacterium to move forward. Clinical microbiologists typically differentiate members of the genus Acinetobacter from other Moraxellaceae by performing an oxidase test, as Acinetobacter spp. are the only members of the Moraxellaceae to lack cytochrome c oxidases.A. baumannii is part of the ACB complex (A. baumannii, A. calcoaceticus, and Acinetobacter genomic species 13TU). It is difficult to determine the specific species of members of the ACB complex and they comprise the most clinically relevant members of the genus. A. baumannii has also been identified as an ESKAPE pathogen (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter species), a group of pathogens with a high rate of antibiotic resistance that are responsible for the majority of nosocomial infections.Colloquially, A. baumannii is referred to as ""Iraqibacter"" due to its seemingly sudden emergence in military treatment facilities during the Iraq War. It has continued to be an issue for veterans and soldiers who served in Iraq and Afghanistan. Multidrug-resistant A. baumannii has spread to civilian hospitals in part due to the transport of infected soldiers through multiple medical facilities. During the COVID-19 pandemic, coinfection with A. baumannii secondary to SARS-CoV-2 infections has been reported multiple times in literature."
2,Actinomyces israelii,"Actinomyces israelii is a species of Gram-positive, rod-shaped bacteria within the genus Actinomyces. Known to live commensally on and within humans, A. israelii is an opportunistic pathogen and a cause of actinomycosis. Many physiologically diverse strains of the species are known to exist, though not all are strict anaerobes. It was named after the German surgeon James Adolf Israel (1848–1926), who studied the organism for the first time in 1878."
3,Agrobacterium tumefaciens,"Agrobacterium radiobacter (more commonly known as Agrobacterium tumefaciens) is the causal agent of crown gall disease (the formation of tumours) in over 140 species of eudicots. It is a rod-shaped, Gram-negative soil bacterium. Symptoms are caused by the insertion of a small segment of DNA (known as the T-DNA, for 'transfer DNA', not to be confused with tRNA that transfers amino acids during protein synthesis), from a plasmid into the plant cell, which is incorporated at a semi-random location into the plant genome. Plant genomes can be engineered by use of Agrobacterium for the delivery of sequences hosted in T-DNA binary vectors.\nAgrobacterium tumefaciens is an Alphaproteobacterium of the family Rhizobiaceae, which includes the nitrogen-fixing legume symbionts. Unlike the nitrogen-fixing symbionts, tumor-producing Agrobacterium species are pathogenic and do not benefit the plant. The wide variety of plants affected by Agrobacterium makes it of great concern to the agriculture industry.Economically, A. tumefaciens is a serious pathogen of walnuts, grape vines, stone fruits, nut trees, sugar beets, horse radish, and rhubarb, and the persistent nature of the tumors or galls caused by the disease make it particularly harmful for perennial crops.Agrobacterium tumefaciens grows optimally at 28 °C. The doubling time can range from 2.5–4h depending on the media, culture format, and level of aeration. At temperatures above 30 °C, A. tumefaciens begins to experience heat shock which is likely to result in errors in cell division."
4,Anaplasma,"Anaplasma is a genus of bacteria of the alphaproteobacterial order Rickettsiales, family Anaplasmataceae.\nAnaplasma species reside in host blood cells and lead to the disease anaplasmosis. The disease most commonly occurs in areas where competent tick vectors are indigenous, including tropical and semitropical areas of the world for intraerythrocytic Anaplasma spp.Anaplasma species are biologically transmitted by Ixodes deer-tick vectors, and the prototypical species, A. marginale, can be mechanically transmitted by biting flies and iatrogenically with blood-contaminated instruments. One of the major consequences of infection by bovine red blood cells by A. marginale is the development of nonhaemolytic anaemia, thus the absence of hemoglobinuria, which allows clinical differentiation from another major tick-borne disease, bovine babesiosis, caused by Babesia bigemina.Species of veterinary interest include:\n\nAnaplasma marginale and Anaplasma centrale in cattle\nAnaplasma ovis and Anaplasma mesaeterum in sheep and goats\nAnaplasma phagocytophilum in dogs, cats, and horses (see human granulocytic anaplasmosis)\nAnaplasma platys in dogs"


In [None]:
bacteriophage_df.head()

Unnamed: 0,Name,Description
0,Bacteriophage,"A bacteriophage (), also known informally as a phage (), is a virus that infects and replicates within bacteria and archaea. The term was derived from ""bacteria"" and the Greek φαγεῖν (phagein), meaning ""to devour"". Bacteriophages are composed of proteins that encapsulate a DNA or RNA genome, and may have structures that are either simple or elaborate. Their genomes may encode as few as four genes (e.g. MS2) and as many as hundreds of genes. Phages replicate within the bacterium following the injection of their genome into its cytoplasm.\nBacteriophages are among the most common and diverse entities in the biosphere. Bacteriophages are ubiquitous viruses, found wherever bacteria exist. It is estimated there are more than 1031 bacteriophages on the planet, more than every other organism on Earth, including bacteria, combined. Viruses are the most abundant biological entity in the water column of the world's oceans, and the second largest component of biomass after prokaryotes, where up to 9x108 virions per millilitre have been found in microbial mats at the surface, and up to 70% of marine bacteria may be infected by phages.Phages have been used since the late 20th century as an alternative to antibiotics in the former Soviet Union and Central Europe, as well as in France. They are seen as a possible therapy against multi-drug-resistant strains of many bacteria (see phage therapy).\nPhages are known to interact with the immune system both indirectly via bacterial expression of phage-encoded proteins and directly by influencing innate immunity and bacterial clearance."
1,Bacillus virus AP50,"Bacillus virus AP50 is a species of bacteriophage that infects Bacillus anthracis bacteria. Originally thought to be an RNA phage, it contains a DNA genome of about 14,000 base pairs in an icosahedral capsid with a two-layer capsid shell.\n\n\n== References =="
2,Bacteriophage AP205,Bacteriophage AP205 is a bacteriophage that infects Acinetobacter bacteria. Contains a genome linear of positive single-stranded RNA. The bacteriophage belongs to the genus Apeevirus of the Duinviridae family and is the type species of the family.\n\n\n== References ==
3,Bacteriophage f2,"Bacteriophage f2 is an icosahedral, positive-sense single-stranded RNA virus that infects the bacterium Escherichia coli. It is closely related to bacteriophage MS2 and assigned to the same species."
4,Bacteriophage MS2,"Bacteriophage MS2 (Emesvirus zinderi), commonly called MS2, is an icosahedral, positive-sense single-stranded RNA virus that infects the bacterium Escherichia coli and other members of the Enterobacteriaceae. MS2 is a member of a family of closely related bacterial viruses that includes bacteriophage f2, bacteriophage Qβ, R17, and GA."


In [None]:
## Save df as csv on drive
#bacteria_df.to_csv('bacteria_df.csv', encoding = 'utf-8-sig') 
#bacteriophage_df.to_csv('bacteriophage_df.csv', encoding = 'utf-8-sig') 
#viruses_df.to_csv('viruses_df.csv', encoding = 'utf-8-sig') 


# Link for part 2 - Exploratory Data Analysis colab notebook
https://colab.research.google.com/drive/1KiEijOrDQG1gGiNWxhvXOBvrgm1URtXO?usp=sharing