## Scraping Content from Documents

[This a folder](https://drive.google.com/file/d/1a_JlM2k_An8CT0MRLYjX6gzYvNm5t3cl/view?usp=share_link) that contains more than two dozen files.

Using the lesson on collecting content from documents, please do the following using Python:

* Analyze ONLY the .txt files (but do not physically remove the other files from this folder).

* Output a CSV file that has 4 columns: year, cognition_related , medical_condition, care_hours

* In the cognition_related column, enter ```True``` if the condition is related to Dementia or Alzheimer's disease. ```False``` if it is not.

* In the medical_conditions column, enter either “Dementia" or “Alzheimer’s" or “Not Specified” depending on the case.
* In the care_hours column, enter either "half-day" for 12-hour care, "full-day" for 24-hour care or “Not Specified"

* Export the CSV to your downloads.


In [1]:
# import libraries

import pandas as pd
import glob

In [2]:
## call and sorted the files
all_txts = sorted(glob.glob("project-docs/*.txt"))
all_txts

['project-docs/decision_01.txt',
 'project-docs/decision_02.txt',
 'project-docs/decision_03.txt',
 'project-docs/decision_04.txt',
 'project-docs/decision_05.txt',
 'project-docs/decision_06.txt',
 'project-docs/decision_07.txt',
 'project-docs/decision_08.txt',
 'project-docs/decision_09.txt',
 'project-docs/decision_10.txt',
 'project-docs/decision_11.txt',
 'project-docs/decision_12.txt',
 'project-docs/decision_13.txt',
 'project-docs/decision_14.txt',
 'project-docs/decision_15.txt',
 'project-docs/decision_16.txt',
 'project-docs/decision_17.txt',
 'project-docs/decision_18.txt',
 'project-docs/decision_19.txt',
 'project-docs/decision_20.txt']

In [41]:
## See the content of the files
for text_file in all_txts:
    with open(text_file, "r") as my_text:
        all_text = my_text.readlines()
        print(all_text)

['Year: 2016\n', '\n', 'The appellant was determined to suffer from Dementia.  \n', '\n', 'The order provides 24-hour care.']
['Year: 2017\n', '\n', 'The appellant was determined to not suffer from Dementia.  \n', '\n', 'The order remains for partial care.']
['Year: 2021\n', '\n', "The appellant was determined to suffer from Alzheimer's disease.  \n", '\n', 'This order provides 24-hour care.']
['Year: 2021\n', '\n', 'The appellant was determined to not suffer from Dementia.  \n', '\n', 'The order remains for partial care.']
['Year: 2018\n', '\n', 'The appellant was determined to suffer from Dementia.  \n', '\n', 'This order provides 12-hour care.']
['Year: 2021\n', '\n', 'The appellant was determined to suffer from Dementia.  \n', '\n', 'This order provides 12-hour care.']
['Year: 2019\n', '\n', 'The appellant was determined to not suffer from Dementia.  \n', '\n', 'The order remains for partial care.']
['Year: 2016\n', '\n', 'The appellant was determined to suffer from Dementia.  \n',

In [37]:
## Code to ge the list of dictionaries
cognitions_list = []

for text_file in all_txts:
    with open(text_file, "r") as my_text:
        all_text = my_text.readlines()
        Year = all_text[0].replace("Year: ", "").replace("\n", "")
        medical_condition = all_text[2]
##        print(medical_condition)
        if "The appellant was determined to suffer from Dementia" in medical_condition:
            medical_condition = "Dementia"
        elif "The appellant was determined to suffer from Alzheimer's disease" in medical_condition:
            medical_condition = "Alzheimer"
        else:
            medical_condition = "Not Specified"
##        print(medical_condition)
        cognition_related = medical_condition
        if "Dementia" in medical_condition:
            cognition_related = True
        elif "Alzheimer" in medical_condition:
            cognition_related = True
        else:
            cognition_related = False
##        print(cognition_related)
        care_hours = all_text[4]
        if "24-hour care" in care_hours:
            care_hours = "24-hour care"
        elif "12-hour care" in care_hours:
            care_hours = "12-hour care"
        else:
            care_hours = "partial care"                
##        print(care_hours)
        cognitions_dict = {"Year": Year, "Medical Condition": medical_condition, "Condition related": cognition_related, "Care hours": care_hours}
        cognitions_list.append(cognitions_dict)

In [38]:
##Check if there it saved as a dictionary
cognitions_list

[{'Year': '2016',
  'Medical Condition': 'Dementia',
  'Condition related': True,
  'Care hours': '24-hour care'},
 {'Year': '2017',
  'Medical Condition': 'Not Specified',
  'Condition related': False,
  'Care hours': 'partial care'},
 {'Year': '2021',
  'Medical Condition': 'Alzheimer',
  'Condition related': True,
  'Care hours': '24-hour care'},
 {'Year': '2021',
  'Medical Condition': 'Not Specified',
  'Condition related': False,
  'Care hours': 'partial care'},
 {'Year': '2018',
  'Medical Condition': 'Dementia',
  'Condition related': True,
  'Care hours': '12-hour care'},
 {'Year': '2021',
  'Medical Condition': 'Dementia',
  'Condition related': True,
  'Care hours': '12-hour care'},
 {'Year': '2019',
  'Medical Condition': 'Not Specified',
  'Condition related': False,
  'Care hours': 'partial care'},
 {'Year': '2016',
  'Medical Condition': 'Dementia',
  'Condition related': True,
  'Care hours': '24-hour care'},
 {'Year': '2014',
  'Medical Condition': 'Dementia',
  'Condi

In [39]:
## Open as a dataframe
df = pd.DataFrame(cognitions_list)
df

Unnamed: 0,Year,Medical Condition,Condition related,Care hours
0,2016,Dementia,True,24-hour care
1,2017,Not Specified,False,partial care
2,2021,Alzheimer,True,24-hour care
3,2021,Not Specified,False,partial care
4,2018,Dementia,True,12-hour care
5,2021,Dementia,True,12-hour care
6,2019,Not Specified,False,partial care
7,2016,Dementia,True,24-hour care
8,2014,Dementia,True,24-hour care
9,2016,Dementia,True,24-hour care


In [42]:
## Export dataframe.
df.to_csv("cognitions.csv", index = False, encoding = "UTF-8")