# HCI 574 General HW Instructions


### Work through the problems
- Answer the questions shown in the HW. Fix anything with ???. Answers to text questions should be given in a printed string, e.g.:  `print("This is the answer")` or as a comment e.g. `# This is the answer`
- Ensure that VSCode is set to autosve your notebook! In _Settings_ search for autosave and set it to 1000 ms.
- It's fine to create new python cells if you want to try something without changing the offcial cell but please ensure that at the end there's only one cell with your official answer to the question and (__very impotant!__) that that cell as been run/executed so we can see its output. If you want to "save" your inofficial cell(s), make sure they are fully commented out!



### Handing in the HW
- Check that all other files the HW might have (screenshots, data files, etc.) are indeed in the correct HW folder.
- Zip your HW folder folder (e.g. into HW1_ALemming.zip). On Windows you can Right-click -> Send To - Compress Folder. Please don't use rar or any other exotic compressors!
- Zip your HW folder and hand it into Gradescope


### Points

The number of points a problem is worth when solved properly is always shown inside brackets at the heading of the problem, e.g.
##### Q1 [ 3.5 pts]  This problem is worth 3.5 points

Sometimes you may get additional extra credits if we feel your solution is particularly clever, etc. 
Some problem are entirely __optional__, these will have a + in from of the points, e.g.
##### Q2 [ +1 pt ] This *optional* problem is worth 1 point

You can solve these optional points to learn more or make up of points you missed in earlier HW. Note however, that there's a cap on the total HW points, if you have more than 100% of HW points the end of the semester, it will be reduced to 100%

### Questions?
If you have questions or need help, ask me after class, use Piazza or ask the TA during office hours  

# HCI 574 - HW 6   


- For this HW you will write code to find files with the same name in two sub-folders inside the current (HW6) folder
- Example with `HW6_netid` as your current(!) folder with 2 subfolders A and C:

```
HW6_netid-----+
|             |
A-----+       C-----+
|     |       |     |
A.txt C.txt   A.txt B.txt
```

- both sub-folders happen to contain a file named `A.txt`
- `A.txt` is what I will call a __duplicate file__
- Your task is to find and report all such _duplicate_ filenames to the user.

In [5]:
from glob import glob # Import the glob function from the glob module

import os
# Here are the methods you may need to use from os and from os.path:
# os.sep, os.getcwd, os.listdir, os.chdir, os.walk
# os.path.exists, os.path.isdir, os.path.isfile, os.path.getsize, os.path.basename, os.path.join, os.path.split



###  Q1  Check that a folder is valid [3 pts ]
- write a function check_folder()
- you will later call `check_folder(folder)` with the name of a folder
- FYI although the folder will later be given by the user, for testing purposes we will just hard-code our input here

<p>

- you need to check that folder passes these tests:
    - does folder actually exist?
    - is folder actually a folder and not a file?
    - if folder is a folder does it contain any files? Otherwise, there can't be any duplicates in it anyway! 
     
- you will need to use functions (imported from os.path earlier) that themselves return True or False.
- If those return True you can return True as well
- If those return False, you must return an informative __error string__ instead!
- you must only return True or an error string, however, you can keep the debug print() inside the function that shows the folder content

<p>
    
- when you develop this function use the 4 tests below. 
- They test each of the four steps in the order suggested below, so you could write step 1 and then run test 1, etc.


In [6]:
def check_folder(folder):
    '''Will check if the folder (string) exists and also is not a file and also in not empty.
    Returns True if ALL three conditions are fulfilled, otherwise returns error straing
    '''

    # 1) check that the folder actually exists  (potential user typo?)

    # on error, return an informative error message (a single string!)
    if os.path.exists(folder) == False:
        return(folder, "IdontExist")
    
    # 2) check that folder is actually a folder, not just a weirdly named file 
    
    # on error, return an informative error message (a single string!)
    if os.path.isdir(folder) == False:
        return(folder, "ImAfileNotaFolder")
    
    # 3) check that folder actally contains any files or folders (i.e. is NOT empty) 
    
    # on error, return an informative error message (a single string!)
    if len(os.listdir(folder)) == 0:
        return(folder, "is empty")
    
    # 4) if no errors occured, return True
    return True
   
              
    

In [7]:
print("current folder:", os.getcwd())
print("content:", os.listdir())


current folder: c:\Users\david\OneDrive\Documents\HCI 574\HW6_dtang
content: ['A', 'B', 'C', 'D', 'E', 'HW6_dtang.ipynb', 'HW6_dtang.zip', 'ImAfileNotaFolder']


### Q2 Test battery for your check_folder() [2 pts]
- Don't change these tests, they must be passed as described (return True or an error string, as described)
- if something is wrong (test passes when it should fail, or vice versa) you need to fix your check_folder() code
- This assumes that the folder with your notebook file contains these folders  ```'A', 'B', 'C', 'D', 'E'``` and this file ```'ImAfileNotaFolder'```
- Each test is worth 0.25 pts

In [8]:
# 1) Does the folder exist?
folder = "IdontExist"
print("test1: Does folder exists? Must give an Error", folder, check_folder(folder))

test1: Does folder exists? Must give an Error IdontExist ('IdontExist', 'IdontExist')


In [9]:
# 2) test file vs folder
folder = "ImAfileNotaFolder"
print("test2: Is it a folder? Must give an Error", folder, check_folder(folder))

test2: Is it a folder? Must give an Error ImAfileNotaFolder ('ImAfileNotaFolder', 'ImAfileNotaFolder')


In [10]:
# 3) Is it a folder with files in it? (no, for empty folder E)
folder = "E"
os.mkdir(folder) # make folder, will be initially empty
print("test3: test if E is empty? E must give an Error", folder, check_folder(folder))

FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'E'

In [11]:
# 4) Is it a folder with files in it? (yes for all 4 folders)
print("test4: Are there actually any files in these folders i.e. is folder not empty?? Must be True for all these folders")

for folder in ["A", "B", "C", "D"]:
    print(folder, check_folder(folder))

test4: Are there actually any files in these folders i.e. is folder not empty?? Must be True for all these folders
A True
B True
C True
D True


### Q3 Define the `get_dups(folder1, folder2)` function [ 8 pts ]

- Define a function `get_dups(folder1, folder2)` that returns a list of duplicate names in 2 given subfolders
- Assuming you are "sitting" in HW6_netid, to get the duplicates for the subfolders A and B, `get_dups("A", "B")`  should return the list ['B.txt', 'D.txt']

<p>

Hints:
- this will be a pretty long function, so you may want to develop and test it inside a .py file in VS Code and use the full debugger!
- Or you could again use each of the test cases below
- folder1 and folder2 args are names of subfolder within the current folder  
- check that both folders are not the same (otherwise the rest would be pointless) otherwise return an error string instead of the list
- use your function check_folder() to check that each of the folder is OK, if not return an error string. Note that you cannot check for False (`if check_folder(folder1) == False:`) b/c the function will only return True or an error string! However, you can check for not True: `if check_folder(folder1) is not True:`

<p>
    
- Use `glob("<subfolder>/*.*")` to get all files in subfolder
- for A, this would be ["A/B.txt", "A/C.dat", "A/D.txt"] 
- make a list of strings containing the filenames of the files in each folder
    - for this you will need to use the os.path method split() to grab the second element (the filename), see below for a example of how this works
    - files_in_folder1 could contain ["A.txt", "C.txt"]
    - files_in_folder2 could contain ["A.txt", "B.txt"]
- the list duplicate_files is initially empty []
- go though files_in_folder1 (one string at a time) and test if it is the same as any of the strings in files_in_folder2 
    - remember that you can use the `in` statement to test if an element is in a list:
    - `"bla" in ["whatever", "stuff", "bla"]` => True, as there's a "bla" in that list
    - if you get a match, append it to duplicate_files
- at the end, the duplicate_files list here would contain ["A.txt"]

- You many need to use os.path.split() to split a path string into a folder name and a file name.
- This is needed so you can use `glob(<foldername>/*.*)` on both of your folders, then split off the filename (e.g. from A\B.txt, get B.txt) and compare it to the filenames from the other folder.


In [12]:
# This shows how os.path.split() works differently from the standard split()
paths = glob("A/*.*")
print(paths) # all files in A, but still with the folder path in from of it! e.g. A\B.txt

# split A\B.txt into folder name (A) and file name (B.txt)
path = paths[0] # first element in list created by glob()
split_tuple = os.path.split(path) # return a tuple with folder name at index 0 and file name at index 1
print(path, split_tuple[0], split_tuple[1] ) 

['A\\B.txt', 'A\\C.dat', 'A\\D.txt']
A\B.txt A B.txt


In [13]:
def get_dups(folder1, folder2):
    """returns a list of files that occur in both folders or an error string"""
    print("in get_dups(), folders are:", folder1, folder2)
    
    # check if folder1 and folder2 are the same, if so return an informative error message (a single string!)
    if folder1 == folder2:
        return(folder1, "and", folder2, "are the same folder")
    
    # check if folder1 is OK (got True), if not (didn't get True) return its error message
    if check_folder(folder1) != True:
        return(check_folder(folder1))

    # check if folder2 is OK (got True), if not (didn't get True) return its error message
    if check_folder(folder2) != True:
        return(check_folder(folder2))

    # We can assume both folders are OK
    duplicate_files = [] # this will later contain the names of files that occur in both folders (duplicates)

    # get files in folder1 in list1
    paths1 = glob(folder1+"/*.*")
    file_list1 = []
    for files1 in paths1:
        split_tuple1 = os.path.split(files1)
        file_list1.append(split_tuple1[1])

    # get files in folder2 in list2
    paths2 = glob(folder2+"/*.*")
    file_list2 = []
    for files2 in paths2:
        split_tuple2 = os.path.split(files2)
        file_list2.append(split_tuple2[1])

    # check if a file from list1 occurs in list2
    # if so, you found a duplicate, so append it to duplicate_files
    for file in file_list1:
        if file in file_list2:
            duplicate_files.append(file)
            file_list2.remove(file)

    
    # when you've done all the checks, return duplicate_file
    return duplicate_files
    

### Q4 Optional: in get_dups() also use the file size for determining duplicates. (2 pts)

- if there are duplicates (say B.txt) in the folders, they also must have the same size (in bytes) to be treated as true duplicates
- In this case, if one B.txt is 100 bytes but the other 120 bytes, they do __NOT__ count as true duplicates despite having the same file name
- for these "pseudo" duplicates (names match but sizes do not), print out something like `B.txt found in both folders but size differs (100 bytes vs 120 bytes)`
- change your code in the above get_dups() cell or copy it into the cell below and make your changes there

###  Q5 Battery of Test cases [ 2 pts]

For this last part you will simply show that your function works as expected. Run the cell below and see if the output matches this: (You are allowed to have some variation to this with your own error messages!)

```
get_dups("A", "B")  -> ['B.txt', 'D.txt'] 

get_dups("A", "C") -> ['B.txt', 'D.txt']

get_dups("A", "A") -> Error: A and A cannot be the same!

get_dups("A", "Stuff") -> Error: Stuff does not exist!

get_dups("A", "E") -> Error: E is empty!
```

with optional file size check you would instead get this for A C:

```
get_dups("A", "C") -> 
B.txt found in both folders but size differs (817 bytes vs 4179 bytes)
['D.txt']
```

In [14]:
# Test cases you must pass:
print("A", "B", get_dups("A", "B"))  # -> ['B.txt', 'D.txt'] 

in get_dups(), folders are: A B
A B ['B.txt', 'D.txt']


In [95]:
print("A", "C", get_dups("A", "C")) # -> ['B.txt', 'D.txt'] 
# or B.txt found in both folders but size differs (817 bytes vs 4179 bytes) ['D.txt']

in get_dups(), folders are: A C
A C ['B.txt', 'D.txt']


In [94]:
print("A", "A", get_dups("A", "A")) # -> Error: A and A cannot be the same!

in get_dups(), folders are: A A
A A ('A', 'and', 'A', 'are the same folder')


In [93]:
print("A", "Stuff", get_dups("A", "Stuff")) # -> Error: Stuff does not exist!

in get_dups(), folders are: A Stuff
A Stuff ('Stuff', 'IdontExist')


In [92]:
print("A", "E", get_dups("A", "E")) # Error: E is empty!

in get_dups(), folders are: A E
A E ('E', 'is empty')
