
#### Student Name: Rajath Akshay Vanikul

Date: 14/03/2019

Version: 2.0

Environment: Python 3.6.4 and Anaconda 5.7.6 (64-bit)

Libraries used:
* re 3.7.3 (for regular expression, included in Anaconda Python 3.6) 
* json 3.6.1 (encoder and decoder, included in Anaconda Python 3.6)


## 1. Introduction
In this assignment we extract data from semi-structured text file. The document contains contains information about several units in the Monash University. Each observation contains information about the unit, e.g., unit
code, unit title, synopsis, requirements, output, chief examiner, etc. There are a total of 400 unit observation in a 2.27 MB file named `29498724.txt`. The task is to extract the data, wrangle it to the right format and transform the data into the `XML` and `JSON` format with the following elements:
1. Unit code: is a 7-character string (three uppercase letter followed by 4 digits).
2. Pre-requisites: only the unit codes of the units that are pre-requisite + co-requisite for the current unit (‘NA’ if the value is Null).
3. Prohibitions: only the unit codes of the units that are prohibited to be taken with the current unit (‘NA’ if the value is Null).
4. Synopsis: a string containing the synopsis of the unit (‘NA’ if the value is Null)
5. Requirements: the list of requirements of the current unit (‘NA’ if the value is Null).
6. Outputs: the list of outputs of the current unit (‘NA’ if the value is Null).
7. Chief-examiners: the list of the chief-examiners of the current unit (‘TBA’ if Null).

The process is followed as follows:
1. Extract each unit information and store the list with each unit observation as an element.
2. Extract all the described categories for each unit and store the information in a dictionary.
3. All of the unit information is stored in a list of dictionary. e.g., Units[{unit1},{unit2}....{unit400}].
4. Complete output is stored as a key-value pair in a dictionary format to transform it to `JSON` format.
5. The final dictionary is then converted to `XML` by appending the respective tags. 
6. Store the `JSON` result into a `29498724.json` file and the `XML` result into a `29498724.xml` file.

More details for each task will be given in the following sections.

## 2.  Import libraries

importing regular expression and JSON libraries

In [20]:
import re
import json

## 3. Loading data

We read the contents of the text file to a variable using the code below:

In [4]:
data_txt = open("29498724.txt",'rt') #opening an instance of the document.
data = data_txt.read() #reading the document to a variable.
data_txt.close() # closing the instance.

## 4. Examining and extracting unit information

### 4.1 Examining valid unit codes

We first examine the unit code of all the observed units in the text file. We find that there are 400 unit information available in the text file.
The code to extract all the class unit codes are as follows:

In [5]:
# using the regular expression to find all the matches and return a list to ID.
ID = re.findall(r'(?:<span class="unitcode">([A-Z]+[0-9]+)</span>)',data)

# determine the length of the list.
len(ID)

400

However, As per the assignment description, we check for all the unit codes which is a 7-character string (3 uppercase letter followed by 4 digits).

In [6]:
# using the regular expression to find all the matches and return a list to ID1.
ID1 = re.findall(r'(?:<span class="unitcode">([A-Z]{3}[0-9]{4})</span>)',data)

# determine the length of the list.
len(ID1)

393

Investigating on the unit codes which does not belong to the specified pattern (3 uppercase letter followed by 4 digits).

In [7]:
# list comprehension to find out the elements in one list(ID) that is absent in the other list(ID1).
[x for x in ID if x not in ID1]

['DPSY7131',
 'DPSY6261',
 'DPSY6105',
 'DPSY6162',
 'DPSY5299',
 'DPSY6199',
 'DPSY5161']

As the above unit codes do not satisfy the specifics given in the assignment, `we discard these unit information`.

### 4.2 Extracting the unit titles

Similarly, we find all the unit titles using the code below:

In [8]:
# re to find all the unit title.
title = re.findall(r'(?:>[A-Z]{3}[0-9]{4}</span> - (.*?)<span)',data)

# determine the length of the list of valid titles.
len(title)

393

### 4.3 Extracting unit synopsis
Now, extracting other specific unit information like synopsis.
Extract the whole chunk of synopsis data post identifying the tag pattern.

However, post examining the complete result, I found few unwanted tags between the text which needs to be eleminated. I have substituted void with unnecessary string pattern, I used 
`re.sub('<.*?>','', synopsis)` to clean the existing data.

The code is as below:

In [9]:
# re to search for the first synopsis chunk in the data.
synopsis = re.search(r'(?:Synopsis</h2>\n<div>\n<p>(.*)</p>)',data)[1]

# Eleminate the unwanted strings in the captured text.
synopsis = re.sub('<.*?>','', synopsis)

# print the final synopsis.
synopsis

"This unit examines major issues in ethical theory. One aspect of this is an inquiry into central questions in the philosophical sub-discipline known as 'metaethics'. Some of these questions include: are there objective moral facts? Is moral judgment grounded primarily in reasoning, or emotion, or something else? And what motivates people to do what they believe is right? The unit also involves an exploration of the strengths and weaknesses of consequentialist ethical theories, like Utilitarianism, which assess the morality of people's actions solely in terms of the consequences of those actions. The unit examines debates between these theories and rival theories that incorporate other elements, such as duties, rights, contractualist principles, and people's character and virtues."

### 4.4 Extracting pre-requisite,co-requisite and prohibisions

As per the specification of the assignment, We will need to store pre-requisite and co-requisite in a single variable called pre-requisite.

I noticed that, i could extract all pre-requisite,co-requisite and prohibisions information chunk using a single regular expression. I followed the following oder to extract the final information:
1. The chunk of data containing pre-requisite,co-requisite and prohibisions information is extracted from the unit data.
2. I have extracted the pre-requisite/co-requisite information from the above extracted text.
3. Extract all the occurrence of unit code from the above text to a list named `pre_req_each`.
4. I have extracted the prohibisions information from the extracted text in step-1.
5. Extract all the occurrence of unit code from the above text to a list named `prohibision_each`.

In [10]:
# re to find all the pre-requisite,co-requisite and prohibisions in the data.
pre_req = re.findall(r'(?s)(?:Prerequisites</p>\n<p>(.*?)</div>\n</div>)|$',data)

# print the fourth element of the list.
pre_req[3]

'<span class="unitlink"><a href="/pubs/2019handbooks/units/ATS2122.html">ATS2122</a></span> or ATS2808</p>\n<p>This unit is only available to students enrolled in a Bachelor of Music single or double degree - Music performance specialisation.</p>\n<p class="hbk-preamble-heading">Prohibitions</p>\n<p>ATS2809</p>\n'

In [15]:
# extracting all the unit codes under pre-requisite and co-requisite chunk in above data element.
pre_req_all = str(re.search(r'(?s)(?:(.*?)Prohibitions)|$',pre_req[3])[1])
pre_req_each = list(set(re.findall(r'([A-Z]+[0-9]+)',pre_req_all)))

# extracting all the unit codes under prohibision chunk in above data element.
prohibision = str(re.search(r'(?s)(?:Prohibitions(.*))|$',pre_req[3])[1])
prohibision_each = list(set(re.findall(r'([A-Z]+[0-9]+)',prohibision)))

# printing the lists of pre-requisite/co-requisite and prohibisions in the data element.
print("pre-req:",pre_req_each)
print("prohibision:",prohibision_each)


pre-req: ['ATS2808', 'ATS2122']
prohibision: ['ATS2809']


### 4.5 Extracting Unit Requirements

Post examining the text data, I noticed the unit requirements are under the tag `Assessment`. 
1. I have extracted all the text between the tags as shown below.
2. Extracted all individual requirement form the Text data obtained in step-1.
3. store the final bullet points as elements in a list(`req_each`).

In [10]:
# re to find all the unit requirements in the data.
req = re.findall(r'(?s)(?:Assessment</h2>.*?<div>(.*?)</div>)',data)

#printing the 22nd element.
req[21]

'\n<p>Within semester assessment: 75%</p>\n<p>Exam: 25%</p>\n'

In [11]:
# extracting all the unit requirements under in above data element.
req_each = re.findall(r'(?s)(?:<p>(.*?)</p>)',req[21])

# printing the list of requirements.
req_each

['Within semester assessment: 75%', 'Exam: 25%']

### 4.6 Extracting Unit Outcomes

Extract all the data between `Outcomes` tag to obtain the outcome points.
1. I have extracted all the text between the tags as shown below.
2. Extracted only the individual outcome between `<li>` and `</li>` form the Text data obtained in step-1.
3. store the final outcome as elements in a list(`outcome_each`).

In [12]:
# re to find all the unit outcomes in the data.
outcomes = re.findall(r'(?s)(?:Outcomes</h2>\n<div>(.*?)</div>)',data)

#printing the first element.
outcomes[0]

'\n<p>Students successfully completing the unit will be able to:</p>\n<ol princestart="0" start="1" type="1">\n<li>explain key concepts, arguments, and principles in ethical theory;</li>\n<li>summarise and interpret contemporary and historical texts in the field of ethics;</li>\n<li>analyse and evaluate detailed philosophical arguments in contemporary and historical texts in the field of ethics;</li>\n<li>put into practice bibliographical skills that are relevant to the discipline of philosophy, including referencing and citation techniques;</li>\n<li>compose original research essays, using the argumentative conventions of philosophical essay-writing, and demonstrating independent critical judgment.</li>\n</ol>\n'

In [13]:
# extracting all the unit outcomes in above data element.
outcome_each = re.findall(r'(?s)(?:<li>(.*?(?!<p>))</li>)',outcomes[0])

# printing the list of outcomes.
outcome_each

['explain key concepts, arguments, and principles in ethical theory;',
 'summarise and interpret contemporary and historical texts in the field of ethics;',
 'analyse and evaluate detailed philosophical arguments in contemporary and historical texts in the field of ethics;',
 'put into practice bibliographical skills that are relevant to the discipline of philosophy, including referencing and citation techniques;',
 'compose original research essays, using the argumentative conventions of philosophical essay-writing, and demonstrating independent critical judgment.']

### 4.7 Extracting Unit's Chief Examiners

I have identified the chief examiners in the text data under the tag `Chief examiner`.
1. I have extracted all the text between the tags as shown below.
2. Extracted only the individual examiner names form the Text data obtained in step-1.
3. store the final outcome as elements in a list(`ce_each`).

In [14]:
# re to find all the unit's chief examiners in the data.
chief_examiner = re.findall(r'(?s)(?:Chief examiner\(s\)</p>(.*?)\n</p>)',data)

# print the first element. 
chief_examiner[0]

'\n<p>\n<a href="http://staffsearch.monash.edu/?name=David Ripley">Dr David Ripley</a>\n<br/>'

In [15]:
# extracting all the unit's chief examiners in above data element.
ce_each = re.findall(r'(?:name=[a-zA-Z ]*">(.*?)</a>)',chief_examiner[0])

# printing the list of chief examiners.
ce_each

['Dr David Ripley']

## 5. Extracting each unit observation.

This is the first step to organise and distinguish between different unit information in the text document. I have extracted each unit information using the regular expression which has yielded me the information of all 400 units.

As we need to consider unit information for only the ones with the unit code of 7 characters, we will filter this out in out actual program in the following step.

In [16]:
# extracing each observation of a unit in chunks to filter out for the specification with respect to unit.
units = re.findall(r'(?s)(?<=<h1 class)(.*?)(?=<!-- /.content_container--> </div>)',data)

len(units)

400

## 6. Complete program obtain the list of dictionaries with unit information.

As we have understood the methods and ways to extract unit informations in the given text.
1. Extract all the listed attributes(unitID, Title, Synopsis, Prerequistics, Prohibisions, Requirements,Outcome, ChiefExaminer) of a unit and store it as a key-value pair in a dictionary. We also create a list of values for a key if required. Sample output is as below:

`{
"@id": "ATS2839", 
"title": "Ethics", 
"synopsis": "This unit examines major issues in ethical theory. One aspect of this is an inquiry into central questions in the philosophical sub-discipline known as 'metaethics'. Some of these questions include: are there objective moral facts? Is moral judgment grounded primarily in reasoning, or emotion, or something else? And what motivates people to do what they believe is right? The unit also involves an exploration of the strengths and weaknesses of consequentialist ethical theories, like Utilitarianism, which assess the morality of people's actions solely in terms of the consequences of those actions. The unit examines debates between these theories and rival theories that incorporate other elements, such as duties, rights, contractualist principles, and people's character and virtues.", 
"pre_requistics": "NA", 
"prohibisions": {"prohibision": ["AZA2939", "AZA3939", "ATS1839"]}, 
"requirements": {"requirement": ["Within semester assessment: 60% + Exam: 40%"]}, 
"outcomes": {"outcome": ["explain key concepts, arguments, and principles in ethical theory;", "summarise and interpret contemporary and historical texts in the field of ethics;", "analyse and evaluate detailed philosophical arguments in contemporary and historical texts in the field of ethics;", "put into practice bibliographical skills that are relevant to the discipline of philosophy, including referencing and citation techniques;", "compose original research essays, using the argumentative conventions of philosophical essay-writing, and demonstrating independent critical judgment."]}, 
"chief_examiners": {"chief_examiner": ["Dr David Ripley"]
}`


2. Append these dictionaries to a list of units which should look like the following structure:

`unit[{unit1},{unit2}.......{unit393}]`

Below written program successfully implements all the above mentioned extraction methods to arrive at the format that is appropriate for the task. 

In [17]:
# initiate a list of unit.
unit = []

# iterate through each chuck of unit information that is extracted.
for i in units:
    # initiate a dictionary to store specified information about a unit. 
    unit_dict ={}
    
    # check for the units with the given unit pattern of 7 characters.
    if re.search(r'(?:<span class="unitcode">([A-Z]{3}[0-9]{4})</span>)|$',i)[1] != None:
        
        # with '@id' as a key, store the extracted unit code as a value.
        unit_dict["@id"] = re.search(r'(?:<span class="unitcode">([A-Z]{3}[0-9]{4})</span>)',i)[1]
        
        # with 'title' as a key, store the extracted unit title as a value.
        unit_dict["title"] = re.search(r'(?:[A-Z]+[0-9]+</span> - (.*?)<span)',i)[1]
        
        # re to search for the synopsis chunk in the data.
        synopsis = str(re.search(r'(?:Synopsis</h2>\n<div>\n<p>(.*)</p>)|$',i)[1])
        
        # re to search for the pre-requistics chunk in the data.
        pre_req = str(re.search(r'(?s)(?:Prerequisites</p>\n<p>(.*?)</div>\n</div>)|$',i)[1])
        
        # re to search for the requirement chunk in the data.
        req = str(re.search(r'(?s)(?:Assessment</h2>.*?<div>(.*?)</div>)|$',i)[1])
        
        # re to search for the outcomes chunk in the data.
        outcomes = str(re.search(r'(?s)(?:Outcomes</h2>\n<div>(.*?)</div>)|$',i)[1])
        
        # re to search for the chief examiner chunk in the data.
        chief_examiner = str(re.search(r'(?s)(?:Chief examiner\(s\)</p>(.*?)\n</p>)|$',i)[1])
        
        # check for the search result of synopsis.
        if synopsis != "":
            # eleminate the unwanted strings in the captured text.
            synopsis = re.sub('<.*?>','', synopsis)
            # with 'synopsis' as a key, store the extracted unit synopsis as a value.
            unit_dict["synopsis"] = synopsis
        else:
            # with 'synopsis' as a key, store 'NA' as a value.
            unit_dict["synopsis"] = "NA"
        
        # check for the search result of pre-requisite,co-requisite and prohibisions.
        if pre_req != "":
            
            # extracting pre-requisite,co-requisite text chunk.
            pre_req_all = str(re.search(r'(?s)(?:(.*?)Prohibitions)|$',pre_req)[1])
            pre_req_each = list(set(re.findall(r'([A-Z]+[0-9]+)',pre_req_all))) # extracting all the unit codes

            # extracting prohibision text chunk.
            prohibision = str(re.search(r'(?s)(?:Prohibitions(.*))|$',pre_req)[1])
            prohibision_each = list(set(re.findall(r'([A-Z]+[0-9]+)',prohibision))) # extracting all the unit codes
            
            # check for the search result of pre-requisite,co-requisite.
            if pre_req_each:
                # with 'pre_requistics' as a key, store the value as a dictionary.
                unit_dict["pre_requistics"] = {"pre_requistic":pre_req_each}
            else:
                unit_dict["pre_requistics"] = "NA"
            
            # check for the search result of prohibisions.
            if prohibision_each:
                # with 'prohibisions' as a key, store the value as a dictionary.
                unit_dict["prohibisions"] = {"prohibision":prohibision_each}
            else:
                unit_dict["prohibisions"] = "NA"
        else:
            unit_dict["pre_requistics"] = "NA"
            unit_dict["prohibisions"] = "NA"


        # check for the search result of requirements.
        if req != "":
            # extracting requirements text chunk.
            req_each = re.findall(r'(?s)(?:<p>(.*?)</p>)',req)
            # check for the search result of requirement list.
            if req_each:
                # with 'requirements' as a key, store the value as a dictionary.
                unit_dict["requirements"] = {"requirement":list(req_each)}
            else:
                unit_dict["requirements"] = "NA"
        else:
            unit_dict["requirements"] = "NA"

        # check for the search result of outcome.
        if outcomes != "":
            # extracting outcome text chunk.
            outcome_each = re.findall(r'(?s)(?:<li>(.*?(?!<p>))</li>)',outcomes)
            # eleminate the unwanted strings in the captured text.
            outcome_each = [(re.sub('<.*?>','',i)) for i in outcome_each]
            # check for the search result of outcome.
            if outcome_each:
                # with 'outcomes' as a key, store the value as a dictionary.
                unit_dict["outcomes"] = {"outcome":outcome_each}
            else:
                unit_dict["outcomes"] = "NA"
        else:
            unit_dict["outcomes"] = "NA"

        # check for the search result of chief examiner.
        if chief_examiner !="":
            # extracting the chief examiners list.
            ce_each = re.findall(r'(?:name=[a-zA-Z ]*">(.*?)</a>)',chief_examiner)
            # check for the search result of chief examiner list.
            if ce_each:
                # with 'chief examiner' as a key, store the value as a dictionary.
                unit_dict["chief_examiners"] = {"chief_examiner":ce_each}
            else:
                unit_dict["chief_examiners"] = "TBA"
        else:
            unit_dict["chief_examiners"] = "TBA"
        
        # append the dictionary of unit information to a list of units.
        unit.append(unit_dict)

Appending the list of units to a dictionary to arrive at the right format specified for the task.

Code is as below:

In [18]:
output = {"units":{"unit":unit}}

## 7. Transforming the output dictionary to JSON format

Using the `json library` we can dump the output dictionary to obtain JSON format file.
1. Create the output file (`29498724.json`) with write permission.
2. Use the `dump()` function from json library to transform the output dictionary to json object.

In [19]:
# create a JSON file and dump the dictionary into the file in JSON format using the JSON package.
with open('29498724.json', 'w') as out:
    json.dump(output, out)

## 8. Transforming the output dictionary to XML format

We iterate through the dictionary and append appropriate tags to obtain the xml format.
1. iterate through the first layer of the vales in the output dictionary. Append the `<key>` tag for key into xml output variable.
2. iterate through the list of units denoted by `d`
3. check the data type of value in the key-value pair of the dictionary in d.
4. If the value is not a dictionary, append the tag `<key>` followed by the value and `</key>` to the xml output variable.
5. if the value is a dictionary, iterate through the dictionary and follow step-4.
6. now append `</unit>` post every unit information as specified in the task.
7. terminate the list with the closing tag of the outermost dictionary's key.
8. use `"".join()` function to concatinate all the elements in the xml output variable.
9. remove all the `\n` added by the join statement above.
10. Create the output file (29498724.xml) with write permission and write the xml output variable to the file.

In [20]:
# initiate a xml output list to append tags and values
xml_output = []
# iterate through the output dictionary with one pair "units".
for a,b in output.items():
    # append the key tag
    xml_output.append('<'+a+'>')
    # iterate through the dictionary with one variable "unit".
    for c,d in b.items():
        # iterate through the list of units.
        for e in d:
            # iterate thorugh the unit information of each unit.
            for k,l  in e.items():
                # check the type of value.
                check=type(l)
                
                # check for dictionary
                if check!=dict:
                    # append unit id as suggested in the task.
                    if '@' in k:
                        xml_output.append('<'+'unit id='+'"'+l+'"'+'>')
                    # append the rest of the tag values to the list.
                    else:
                        xml_output.append('<'+k+'> '+l+' </'+k+'>') 
                else:
                    # iterate through the dictionary if the value is a dict.
                    for m,n in l.items():
                        # append the key tag.
                        xml_output.append('<'+k+'>')
                        # append the list of values with key tags
                        for o in n:                            
                            xml_output.append('<'+m+'>'+o+'</'+m+'>')
                        xml_output.append('</'+k+'>') #closing tag
            xml_output.append('</'+'unit'+'>') # closing unit tag after each unit as suggested in the task.
    xml_output.append('</'+a+'>') # closing tag for the outermost dictionary key.

In [21]:
xml_output=''.join(xml_output) # converts list into string
xml_output=xml_output.replace('\n','') # when converted it adds '\n' and hence removing it by using replace

In [22]:
# create a XML file and write the output into the file in XML format.
xml_file = open('29498724.xml', 'w')
xml_file.write(str(xml_output))

677903

## 9. Summary

I have identified all the below mentioned 7 parameters in each of the valid units mentioned in the text data.
The final result will satisfy all the below mentioned conditions:
1. Unit code: is a 7-character string (three uppercase letter followed by 4 digits).
2. Pre-requisites: only the unit codes of the units that are pre-requisite + co-requisite for the current unit (‘NA’ if the value is Null).
3. Prohibitions: only the unit codes of the units that are prohibited to be taken with the current unit (‘NA’ if the value is Null).
4. Synopsis: a string containing the synopsis of the unit (‘NA’ if the value is Null)
5. Requirements: the list of requirements of the current unit (‘NA’ if the value is Null).
6. Outputs: the list of outputs of the current unit (‘NA’ if the value is Null).
7. Chief-examiners: the list of the chief-examiners of the current unit (‘TBA’ if Null).

I have performed the following steps to complete this task:
1. Identify the regular expressions to extract the required texts from the lager data.
2. Using the larger text data for each unit, i have extracted the values for the unit parameters as specified and assigned "NA" or "TBA" as specified.
3. Ordered the extracted data into the required dictionary and list combination.
4. Transform the output into JSON and XML format.

## 10. References

* https://stackoverflow.com/questions/24867342/regex-get-string-between-two-strings-that-has-line-breaks

* https://stackoverflow.com/questions/38579725/return-string-with-first-match-regex/38579881
    
* https://stackoverflow.com/questions/8303488/regex-to-match-any-character-including-new-lines
    
* https://stackoverflow.com/questions/8703017/remove-sub-string-by-using-python

In [31]:
data_txt = open("29498724_vocab.txt",'rt') #opening an instance of the document.
my_vocab = data_txt.read() #reading the document to a variable.
data_txt.close()

data_txt = open("29498724_vocab_test.txt",'rt') #opening an instance of the document.
test = data_txt.read() #reading the document to a variable.
data_txt.close()

In [26]:
my =re.findall(r'(?:\n(.*?):)',my_vocab)

In [29]:
my.append("Analyse")

In [30]:
len(my)

233

In [38]:
testt =re.findall(r'(?:\n(.*?):)',test)
test

'Apply:0\nAustralia:1\nAustralian:2\nCritically:3\nDemonstrate:4\nDescribe:5\nExplain:6\nIdentify:7\nUnderstand:8\nabil:9\nacadem:10\nacquisit:11\nactiv:12\naddit:13\nadvanc:14\naim:15\nanalys:16\nanalyse_the:17\nanalysi:18\nand_apply:19\nand_its:20\nand_other:21\nand_safety:22\nappli:23\napplic:24\napprais:25\napproach:26\narea:27\narticul:28\naspect:29\nassess:30\nassociated_with:31\nawar:32\nbase:33\nbasic:34\nbehaviour:35\nbusi:36\ncapac:37\ncare:38\ncase:39\ncase_studies:40\nchalleng:41\ncharacterist:42\nclinic:43\ncommon:44\ncommun:45\ncomplex:46\ncomprehensive_understanding:47\nconcept:48\nconceptu:49\ncondit:50\nconduct:51\nconsider:52\ncontemporari:53\ncontext:54\ncontrol:55\ncore:56\ncreat:57\ncritic:58\ncritically_analyse:59\ncritically_evaluate:60\ncritiqu:61\ncultur:62\ncurrent:63\ndata:64\ndealing_with:65\ndebat:66\ndecis:67\ndemonstr:68\ndescrib:69\ndescribe_the:70\ndesign:71\ndetermin:72\ndevelop:73\ndigit:74\ndisciplin:75\ndiscuss:76\ndivers:77\neconom:78\neffect:79\ne

In [39]:
testt.append("Apply")

In [40]:
wrong = [x for x in testt if x not in my]


In [42]:
len(wrong)

187