### *text_mining* is a Python package developed as part of this repository

In [2]:
import pandas as pd
from datetime import datetime 
import re 
from text_mining import textMining

In [14]:
# Opening text file 
doc = []
with open('data/raw_text_with_dates.txt') as file:
    for line in file:
        doc.append(line)

data = pd.Series(doc)
data.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

### Usecase 1: Data Exploration 
- Calculate the average length of text data in a pandas DataFrame or Series

In [15]:
tm = textMining(data)
avg_length_per_doc = tm.find_avg_length_of_text()

Average length of text per doc: 135.876


### Usecase 2: Extract Dates
- To find and extract date from raw text when 'date' can be written in any of the formats below: 
    - 04/20/2009; 04/20/09; 4/20/09; 4/3/09
    - Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
    - 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
    - Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
    - Feb 2009; Sep 2009; Oct 2010
    - 6/2008; 12/2009 
    - 2009; 2010

- Assign found date in a new column and convert to date type

- DataType for demonstration: Pandas dataframe

In [16]:
result_df = tm.extract_date()
result_df.head()

Unnamed: 0,original_text,date
0,03/25/93 Total time of visit (in minutes):\n,1993-03-25
1,6/18/85 Primary Care Doctor:\n,1985-06-18
2,sshe plans to move as of 7/8/71 In-Home Servic...,1971-07-08
3,7 on 9/27/75 Audit C Score Current:\n,1975-09-27
4,2/6/96 sleep studyPain Treatment Pain Level (N...,1996-02-06


### Usecase 3: Extract social handles 

In [17]:
# Create dummy data 
dummy = {'Author': ['John', 'Mark', 'Jenny', 'Sammi', 'Chris'],
         'Content': ['hi @Mark I am here', '@sample save 20% on http://www.mark.com', '@hello, visit https://www.jenny.com', 'Remind me @john @doe and @sarah', 'call me at 123-456-789']}

dummy_df = pd.DataFrame(dummy)
dummy_df

Unnamed: 0,Author,Content
0,John,hi @Mark I am here
1,Mark,@sample save 20% on http://www.mark.com
2,Jenny,"@hello, visit https://www.jenny.com"
3,Sammi,Remind me @john @doe and @sarah
4,Chris,call me at 123-456-789


In [18]:
extract_handles = textMining(dummy_df)
extract_handles.extract_social_handle('Content')

Unnamed: 0,Author,Content,extracted_handle
0,John,hi @Mark I am here,@Mark
1,Mark,@sample save 20% on http://www.mark.com,@sample
2,Jenny,"@hello, visit https://www.jenny.com",@hello
3,Sammi,Remind me @john @doe and @sarah,@john
4,Sammi,Remind me @john @doe and @sarah,@doe
5,Sammi,Remind me @john @doe and @sarah,@sarah


### Usecase 4: Remove URLs 

In [4]:
# Sample text containing URLs
sample_text = """
Check out these websites:
- http://www.example.com
- https://example.org
- www.sample-site.com
- sample-site.net
- sample.site78.com
"""

# Call the class
tm = textMining(sample_text)

# Remove URLs from the sample text
cleaned_text = tm.remove_urls(sample_text)

# Print the cleaned text
print(cleaned_text)


Check out these websites:
- 
- 
- 
- 
- 

