## # Introduction
<p><img src="https://i.imgur.com/kjWF1So.jpg" alt="Different characters on a computer screen"></p>
<p>According to a 2019 <a href="https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/PasswordCheckup-HarrisPoll-InfographicFINAL.pdf">Google / Harris Poll</a>, 24% of Americans have used common passwords, like <code>abc123</code>, <code>Password</code>, and <code>Admin</code>. Even more concerning, 59% of Americans have incorporated personal information, such as their name or birthday, into their password. This makes it unsurprising that 4 in 10 Americans have had their personal information compromised online. Passwords with commonly used phrases and personal information makes cracking a password drastically easier.</p>
<p>You may have noticed over the years that password requirements have increased in complexity, including recommendations to change your passwords every couple of months. Compiled from industry recommendations, below is a list of passwords requirements you will be asked to test: </p>
<p><strong>Password Requirments:</strong></p>
<ol>
<li>Must be at least 10 characters in length</li>
<li>Must contain at least:<ul>
<li>one lower case letter </li>
<li>one upper case letter </li>
<li>one numeric character </li>
<li>one non-alphanumeric character</li></ul></li>
<li>Must not contain the phrase <code>password</code> (case insensitive)</li>
<li>Must not contain the user's first or last name, e.g., if the user's name is <code>John Smith</code>, then <code>SmItH876!</code> is not a valid password.</li>
</ol>
<p>Here is the dataset that you will investigate this project:</p>
<div style="background-color: #ebf4f7; color: #595959; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/logins.csv</b></div>
Each row represents a login credential. There are no missing values and you can consider the dataset "clean".
<ul>
    <li><b>id:</b> the user's unique ID.</li>
    <li><b>username:</b> the username with the format {firstname}.{lastname}.</li>
    <li><b>password:</b> the password that may or may not meet the requirements. <i>Note, passwords should never be saved in plaintext, always encrypt them when working with real live passwords!</i></li>
</ul>
</div>
<p>Warning: This dataset contains some <strong>real</strong> passwords leaked from <strong>real</strong> websites. These passwords have been filtered, but may still include words that are explicit and offensive.</p>
<p>From here on out, it will be your task to explore and manipulate the existing data until you can answer the two questions described in the instructions panel. Feel free to import as many packages as you need to complete your task, and add cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><strong>Note:</strong> To complete this project, you need to know how to manipulate strings in pandas DataFrames and be familiar with regular expressions. Before starting this project we recommend that you have completed the following courses: <a href="https://learn.datacamp.com/courses/data-cleaning-in-python">Data Cleaning in Python</a> and <a href="https://learn.datacamp.com/courses/regular-expressions-in-python">Regular Expressions in Python</a>.</p>

In [98]:
import pandas as pd 

df = pd.read_csv('datasets/logins.csv') 

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 982 entries, 0 to 981
Data columns (total 3 columns):
id          982 non-null int64
username    982 non-null object
password    982 non-null object
dtypes: int64(1), object(2)
memory usage: 23.1+ KB
None


In [99]:
print(df.sample(n = 5))

      id         username        password
262  263      scot.golden      jSx7c4L3hq
943  944       andrea.day        akera007
274  275  clifford.steele  SjUQ09qKmXZGHR
221  222  stanley.workman          nickgd
492  493     stefan.irwin     QamzEqsdyvq


In [100]:
# מחזיר אמת עם מכיל לפחות 10 תוים
len_check = df["password"].str.len() >= 10 

# זה מחזיר לי רק את סיסמאות שהם גדולות מ10
df1 = df[len_check]
print(df1.head())
print(df1.shape)

   id          username               password
0   1    vance.jennings         vanceRules888!
1   2    consuelo.eaton  Mail_Pen%Scarlets.414
4   5    araceli.wilder             Araceli}r3
5   6  shawn.harrington            126_239_123
6   7        evelyn.gay           `4:&iAt$'o~(
(560, 3)


In [101]:
# מחזיר אמת עם מכיל לפחות תו קטן אחד
lower_check = df1["password"].str.contains('[a-z]')
# מחזיר אמת עם מכיל לפחות תו גדול אחד
upper_check= df1["password"].str.contains('[A-Z]')
# מחזיר אמת עם מכיל לפחות מספר אחד
num_check = df1["password"].str.contains('[\\d]')
# מחזיר אמת עם מכיל לפחות ערכך מיוחד אחד
idenet_check = df1["password"].str.contains('[\w]')
# זה בונה לי עם סיסמאות שכן מכילות אלו תנאים
df2 = df1[(lower_check) & (upper_check) & (num_check) & (idenet_check)]

print(df2.head())
print(df2.shape)

   id        username               password
0   1  vance.jennings         vanceRules888!
1   2  consuelo.eaton  Mail_Pen%Scarlets.414
4   5  araceli.wilder             Araceli}r3
6   7      evelyn.gay           `4:&iAt$'o~(
8   9     gladys.ward            =Wj1`i)xYYZ
(372, 3)


In [102]:
#מחזיר אמת  עם סיסמה מכילה את ערך סיסמה ש
pass_check = df2["password"].str.contains("password", case = False)
#מחזיר את כל הסיסמאות שלא מכילות ערך סיסמה
df3 = df2[~pass_check]

print(df3.head())
print(df3.shape)

   id        username               password
0   1  vance.jennings         vanceRules888!
1   2  consuelo.eaton  Mail_Pen%Scarlets.414
4   5  araceli.wilder             Araceli}r3
6   7      evelyn.gay           `4:&iAt$'o~(
8   9     gladys.ward            =Wj1`i)xYYZ
(371, 3)


In [103]:
# מחפשים עם הסיסמה מכילה את ערכים של שם פרטי או משפחה בתוך הסיסמה 
# אם נשם לב בשורה שלישית השמות המשתמשים מתפצלים לפי שם פרטי .שם משפחה

df3['first_name'] = df3["username"].str.split('.',expand = True)[0]
df3['last_name'] = df3["username"].str.split('.',expand = True)[1] 

print(df3.head())

   id        username               password first_name last_name
0   1  vance.jennings         vanceRules888!      vance  jennings
1   2  consuelo.eaton  Mail_Pen%Scarlets.414   consuelo     eaton
4   5  araceli.wilder             Araceli}r3    araceli    wilder
6   7      evelyn.gay           `4:&iAt$'o~(     evelyn       gay
8   9     gladys.ward            =Wj1`i)xYYZ     gladys      ward


In [104]:
# מחזיר אמת עם הסיסמה מכילה שם פרטי או משפחה
user_check_list = []
for i, row in df3.iterrows():
    if row.first_name in row.password.lower() or row.last_name in row.password.lower() :
        user_check_list.append(False) 
    else:
        user_check_list.append(True) 
        
user_check = pd.Series(user_check_list,index=df3.index ,name = 'password') 

df4 = df3[user_check]
print(df4.head())
print(df4.shape)

    id        username                    password first_name last_name
1    2  consuelo.eaton       Mail_Pen%Scarlets.414   consuelo     eaton
6    7      evelyn.gay                `4:&iAt$'o~(     evelyn       gay
8    9     gladys.ward                 =Wj1`i)xYYZ     gladys      ward
13  14   jamie.cochran  Deviants.Assists.Impede+24      jamie   cochran
15  16      lorrie.gay                Q0G:[@u9*_`_     lorrie       gay
(358, 5)


In [105]:
# אחוז הסיסמות חזקת פחות 1 מחזיר אחוז סיסמאות חלשות
bad_pass = 1 - round(df4.shape[0] / df.shape[0],2) 
print(bad_pass)

0.64


In [106]:
# נבודק מהנתונים הכללים את המשתמשים עם סיסמאות טובות כלומר נקים רשימה עם משתמשים עם סיסמאות חלשות

x = set(df["username"]).difference(df4["username"])
y = df["username"].isin(x)
z = df[y]

email_list = z["username"]

print(email_list.head())
print(email_list.shape)

0      vance.jennings
2     mitchel.perkins
3      odessa.vaughan
4      araceli.wilder
5    shawn.harrington
Name: username, dtype: object
(624,)
