Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoke-WebRequest should save the file name with the name in Content-Disposition #11671

Open
he852100 opened this issue Jan 24, 2020 · 30 comments
Labels
Issue-Enhancement the issue is more of a feature request than a bug KeepOpen The bot will ignore these and not auto-close WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module WG-NeedsReview Needs a review by the labeled Working Group
Milestone

Comments

@he852100
Copy link

he852100 commented Jan 24, 2020

How to get a file name from the server when downloading a file?

The server has a Content-Disposition property, how to get it, and is used when downloading files

Steps to reproduce

$URL=@'
https://skydrive.live.com/download?resid=A78ACCAEBB24EDD7!37948&authkey=!AKjXpDoxgONjhUw
'@

Iwr $URL -outfile ./
Or
$a=Iwr $URL -outfile ./sss

Expected behavior

Autoaddfilename
Or
PS >$a.headers
  Key                 Value
---                 -----
x-amz-id-2          {XAI956ZLVN5q12SAfXINuiiU6rUrLD0w88mtCcMt1uLcwwv…
x-amz-request-id    {3EE4C53FE4416D0E}
Date                {Fri, 24 Jan 2020 08:36:04 GMT}
ETag                {"c4ac8566c3831011ab981c9662b33183"}
Accept-Ranges       {bytes}
Server              {AmazonS3}
Last-Modified       {Wed, 18 Dec 2019 08:27:16 GMT}
Content-Disposition {attachment; filename=OpenSSH-Win32.zip}
Content-Type        {application/octet-stream}
Content-Length      {3018600}

Actual behavior

Error or null

Environment data


@he852100 he852100 added the Issue-Question ideally support can be provided via other mechanisms, but sometimes folks do open an issue to get a label Jan 24, 2020
@he852100
Copy link
Author

You must now specify it in a variable. When downloading large files will become difficult

@iSazonov
Copy link
Collaborator

If I understand correctly you want to set only target directory and use file name from Content-Disposition.
What behavior (in a script, in interactive session) should be if this header is not present?

@iSazonov iSazonov added the WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module label Jan 24, 2020
@mklement0
Copy link
Contributor

mklement0 commented Jan 24, 2020

I agree that the following would make perfect sense and constitutes a simple enhancement to Invoke-WebRequest :

  • Allow passing a mere directory path to Invoke-WebRequest's -OutFile parameter; e.g., . to ask the downloaded file to be placed in the current directory.

  • If the -OutFile argument is indeed a directory path, look for the server-side file name in the Content-Disposition response-header field, and use that as the local file name for the downloaded file; if no such header field is present or if it contains an invalid file name, report a statement-terminating error.

This has been asked for twice before - #6618 and #6537 - but shot down on what appears to me to be flawed reasoning: this is a simple enhancement to an extant feature that makes its use more convenient, not an attempt at "feature creep".

@SteveL-MSFT, given that and that this is now being asked for for the third time - can we revisit this?

@mklement0
Copy link
Contributor

mklement0 commented Jan 24, 2020

P.S.: Given that the file name would then be controlled by the server of origin, there's an even more pressing need to implement #3174 and to use -LiteralPath rather than -OutFile.

@he852100
Copy link
Author

Need to get 'Content-Disposition' approach, there is currently no method.To get it you must store "iwr" as a variable. It will take up a lot of memory.

Through links, sometimes it's hard to get a file name,

@iSazonov
Copy link
Collaborator

There is a reference to RFC Download cmdlet PowerShell/PowerShell-RFC#124

@mklement0
Copy link
Contributor

@he852100 , yes, the current workaround is both inconvenient and inefficient (not sure if how the header field is parsed below is robust enough):

$url = 'https://github.com/PowerShell/Win32-OpenSSH/releases/download/v8.1.0.0p1-Beta/OpenSSH-Win32.zip'
$response = iwr $url  # stores the full file content in memory
$filename = $response.Headers.'Content-Disposition' -replace '.*\bfilename=(.+)(?: |$)', '$1'
$outDir = Convert-Path $pwd
[IO.File]::WriteAllBytes("$outDir/$filename", $response.Content)

Yes, @iSazonov, but that is a separate matter - a dedicated downloading cmdlet with all the bells and whistles is also nice to have.

As stated, this is simply about making an existing feature more convenient - arguably, it should have worked that way from the beginning.

@iSazonov
Copy link
Collaborator

@mklement0 The RFC comes from a fact that web cmdlets are already very complex - they have over 30 parameters! And now we discuss that -OutFile will have additional semantics. This not only opens the way to new errors (how to distinguish path and file? What is user skip final slash? It is very bad UX for script language.) and typos, but it also confuses users by complicity.

@mklement0
Copy link
Contributor

they have over 30 parameters!

So it's a good thing that this enhancement doesn't require a new parameter.

how to distinguish path and file? What is user skip final slash?

The semantics would be the exact same as with Copy-Item with a file source path, for instance: if the target path refers to an existing directory, copy to that directory - whether or not there is a trailing \ or /; if the target path doesn't exist, treat it as file.

No additional complexity, just long-established and well-known patterns.

@he852100
Copy link
Author

@iSazonov Can I transfer headers to a variable when I have the OutFile parameter?

You just need to provide this feature,The user determines the processing method by the variable value.

$a=iwr f outfile ./dff.vvg
$a=$a.headers xxx
rename-item xxx

@iSazonov
Copy link
Collaborator

I should again notice that web cmdlets is very complex. We should avoid to add more complicity. For file processing scenarios we have @markekraus 's RFC.

@mklement0
Copy link
Contributor

As noted, enhancing Invoke-WebRequest as proposed and the RFC aren't mutually exclusive.
The proposed feature adds little complexity, from what I can tell, and only makes the download feature that is already there and won't go away complete.

@mklement0
Copy link
Contributor

@he852100: The existing -PassThru switch provides similar functionality in conjunction with -OutFile, but it passes the entire response through - including the whole file content in property .Content.

I don't see a need for a new parameter that passes just the header through - the proposed enhancement (arguably: fix) requires no new parameters and is the most straightforward solution to the problem.

@SteveL-MSFT
Copy link
Member

Agree that the proposed enhancement and the RFC are not mutually exclusive and downloading the whole file just to get the filename is a pretty bad experience. Overloading -OutFile also seems prone to cause problems. A new -OutPath may make more sense.

@iSazonov iSazonov added Issue-Enhancement the issue is more of a feature request than a bug and removed Issue-Question ideally support can be provided via other mechanisms, but sometimes folks do open an issue to get a labels Jan 25, 2020
@mklement0
Copy link
Contributor

mklement0 commented Jan 25, 2020

Thanks, @SteveL-MSFT.

Adding a new parameter is still preferable to not implementing this enhancement at all, but please consider this regarding overloading -OutFile:

  • While the word "file" in -OutFile may be confusing, that issue will go away with implementing the already green-lighted fix in Discussion: Use literal paths for OutFile parameter on Invoke-RestMethod / Invoke-WebRequest #3174 - which should be implemented either way - because you would then use -LiteralPath, and "path" can refer to either files or directories.

  • With the name issue out of the way, I don't think the proposed behavior will cause problems, given how well-established the behavior is in the target-path semantics of commands such Copy-Item, cp, xcopy, copy:

    • If you specify an existing directory, the file will be placed by its server-designated name there.
    • If you specify a file path, that path is saved to as-is; a file path is either:
      • the path to an existing file (which is then quietly replaced, as before)
      • the path to an existing directory followed by the name of a new file to be created

@SteveL-MSFT
Copy link
Member

@mklement0 I would support someone implementing it as an experimental feature and so we can get real world usage to validate the experience :)

@mklement0
Copy link
Contributor

@CarloToso has graciously tackled an implementation, in the following PR:

New information has since come to light, which requires further discussion:

  • While Content-Disposition is worth supporting - while it not may be all-too common, it is used in practice, notably on GitHub (for instance, the Content-Disposition file name for https://github.com/mklement0/ttab/archive/stable.zip is ttab-stable.zip, which is more useful than using just the last URI segment, stable.zip) - but perhaps better on an opt-in basis, which is what curl and wget offer, given the following warning from curl's man page:

    • WARNING: Exercise judicious use of this option, especially on
      Windows. A rogue server could send you the name of a DLL or
      other file that could be loaded automatically by Windows or some
      third party software.

  • At this point, the implementation in the PR (exclusively) uses the ultimately redirected-to target URI's last segment as the file name.

    • This is defensible, but differs from how curl handles this with its -O parameter, in that it uses the input URI's last segment (while also requiring -L in order to follow redirections).

    • If we stick with the current approach, there's a discoverability problem, which would also apply to a Content-Disposition opt-in: how will the user learn what file name was ultimately used, short of examining the output directory for the most recently added file (which is cumbersome and not fully robust).


My suggestion is as follows:

  • Provide a Content-Disposition opt-in via a new switch, say -UseSuggestedFileName, which would have to report an error if the -OutFile argument targets a file.

    • I know there are concerns over having too many parameters, but I don't seen another option.
  • As for the discoverability problem:

    • At the very least, -Verbose should include the file name ultimately used.
    • -PassThru should use an ETS instance member to decorate the passed-through Microsoft.PowerShell.Commands.BasicHtmlWebResponseObject instance with a .FullName property that reflects the output file's full path.
      • This would preserve backward-compatibility, but may still be obscure.
      • Ideally, a [System.IO.FileInfo] instance representing the output file would be returned if -PassThru is combined with -OutFile; while this amounts to a breaking change, it strikes me as the more sensibly behavior, and perhaps it falls into bucket 3.
  • As for how to parse Content-Disposition and fallback behavior:

    • Based on the description of the Content-Disposition header field, the C# equivalent of the following is probably good enough in practice, though the recommendation is to give precedence to filename* (sic) entries:

      • <value-of-Content-Disposition-field(s)> -replace '^.*\bfilename\*?="?([^;"]+)?.*', '$1'
      • If there's no Content-Disposition field or if the file-name extraction fails, fall back to the last-URI-segment approach

@iSazonov
Copy link
Collaborator

  • fall back to the last-URI-segment approach

Since this is already implemented in #19007 we can merge the PR. It seems it works well without experimental feature.
Then we could consider the Content-Disposition proposal.

@mklement0
Copy link
Contributor

mklement0 commented Feb 27, 2023

Fair enough - the Content-Disposition opt-in can be tackled later, but, given the discoverability problem I mentioned, we should at least decide on - and then implement as part of #19007 - how to pass the output file name through with -PassThru (and to also mention it in -Verbose output).

@CarloToso
Copy link
Contributor

CarloToso commented Mar 15, 2023

@mklement0 As we wait for #19007 to be merged this is the current state in local:

internal static string GetOutFilePath(HttpResponseMessage response, string _qualifiedOutFile)
{
    if (Directory.Exists(_qualifiedOutFile))
    {
        string contentDisposition = response.Content.Headers.ContentDisposition?.FileNameStar ?? response.Content.Headers.ContentDisposition?.FileName;
        string pathAndQuery = response.RequestMessage.RequestUri.PathAndQuery;

        if (!string.IsNullOrEmpty(contentDisposition)) 
        {
            // Get file name from Content-Disposition header if present
            return Path.Join(_qualifiedOutFile, contentDisposition);
        }
        else if (pathAndQuery != "/")
        {
            string lastUriSegment = System.Net.WebUtility.UrlDecode(response.RequestMessage.RequestUri.Segments[^1]);
            string requestUri = "CURL_URI_STRING"; // We get the last segment, must be sanitized

            // Should I use ValueTuple<bool, bool> or (bool, bool) ?
            ValueTuple<bool, bool> test = (lastUriSegment.Contains('.'), requestUri.Contains('.'));
            
            // Is this easily understandable?
            string result = test switch
            {
                (true, _) or
                (false, false) => lastUriSegment,
                (false, true) => requestUri,
            };
            //I need better names for test and result

            // Get file name from last segment of Uri
            return Path.Join(_qualifiedOutFile, result);
        }
        else
        { 
            //Warning here?

            // File name not found use sanitized Host name instead
            return Path.Join(_qualifiedOutFile, response.RequestMessage.RequestUri.Host.Replace('.', '_'));
        }
    }
    else
    {
        return _qualifiedOutFile;
    }
}

Could you please review?

@mklement0
Copy link
Contributor

Thank you, @CarloToso.

I'll leave the question about the specific of the C# code to others (it looks understandable to me), by my thoughts on the logic are:

  • The current PR will give use last-response-URI-segment logic, i.e the last segment of the ultimately-redirected-to URI, (only) for now (which, without later also enhancing -PassThru to reveal the target filename, presents a discovery problem for the user).

  • What your proposed code does is to roll the following behaviors into one, situationally deciding which behavior to apply automatically:

    • Use a Content-Disposition filename if available.
    • Otherwise, if the path component of the ultimately-redirected-to URI isn't empty or just /:
      • If the last path segment of the ultimately-redirected-to URI contains ., use it.
      • If it doesn't, but the last path segment of the input URI doesn't contain . either, use it.
      • Otherwise (only the last path segment of the input URI contains .), use the last path segment of the input URI.
    • Otherwise (there's only a hostname in the ultimately-redirected-to URI, e.g. http://example.org), use a sanitized version of the hostname (e.g. example_org)

This presents two challenges:

  • It makes use of the Content-Disposition-suggested filename without an explicit opt-in from the user

    • Elsewhere you suggested emitting a warning in that case, given the previously discussed potential security issues around letting the server decide the filename.
    • However, a deliberate opt-in strikes me as the better solution (the warning could then be relegated to the documentation, the way curl does it) - even though that means we need a new switch, say -AutoFileName (which could even be used in isolation, implying -OutFile .)
  • It may make it too hard for users to understand and remember these rules; I feel that simpler logic is called for:

    • Invariably default to the last segment of the input URI (like curl does):

      • This makes it obvious to the user what the target filename will be - no need for discovery via -PassThru

        • Unfortunately, this is at odds with the current PR, but perhaps we can still change course there.
      • Error out if there is no (usable) path segment in the input UI (that is, don't fall back to the hostname).

    • Honor Content-Disposition only with the opt-in switch, and allow discovery of the server-chosen filename via -PassThru

      • If there is no (valid) Content-Disposition header field, fall back to the default behavior.
    • Note that this takes use of the last path segment of the ultimately-redirected-to URI completely out of the picture, but I think that's acceptable.

@CarloToso
Copy link
Contributor

@mklement0 Thank you for your thorough review.

  • In my opinion the usage of this new feature (let's call it autonaming) intrinsecally exposes the user to a some risk wether we use the last segment or the content-disposition. The last segment of the original uri is safer but often less informative and so less useful.
  • I wouldn't use an opt in, too many parameters and discoverability problem.
  • I don't think the users need to remember these rules, they are delegating the choice of the name to PowerShell. If they wanted to control the output they would have added a name string to outfile
  • Erroring out would be a breaking change, I would prefer to avoid it (we could look for a better naming scheme)

I will wait for @SteveL-MSFT to decide the best course of action before continuing.

@mklement0
Copy link
Contributor

mklement0 commented Mar 17, 2023

Thanks, @CarloToso.

intrinsically exposes the user to a some risk

I do see your point, but with the last-input-URI-segment logic that risk is known, unlike with the other two options.

too many parameters and discoverability problem.

I do see the parameter-proliferation problem; but if opt-in is desirable (TBD), there's no other way.
Discoverability could be helped by pointing to -AutoFileName from the description of -OutFile in the documentation.
And, as stated, you should be able to use only -AutoFileName, as shorthand for -AutoFileName -Out-File .

An opt-in would also allow us to report the then-unknown resulting filename by default, instead of the user needing to remember to ask for it via -PassThru.

Erroring out would be a breaking change

Since nothing has been released yet that would allow you to pass a mere directory path to -OutFile, I don't think it would be a breaking change - or am I missing something?

That said, with -AutoFileName it would make sense to fall back to the sanitized hostname, and considering the last segment of the ultimately-redirected-to-URI makes sense then as well.
An error would then only be reported if you don't use -AutoFileName and no suitable last-input-URI-segment is present.

@CarloToso
Copy link
Contributor

@SteveL-MSFT friendly ping

@iSazonov
Copy link
Collaborator

I don't see a problem if we try Content-Disposition first and then make a fallback to last segment. If an user is not satisfied with this he can always use the explicit name as he does now.

@majkinetor
Copy link

Oh, finally, thank you for this @CarloToso 🥇

@MatejKafka
Copy link
Contributor

@CarloToso The provided code looks more-or-less the same as what Chromium is doing, sans some sanitization: https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L219 In general, it would make sense to me to stay close to the behavior of major browsers as much as possible, since that's what most users likely expect as the default.

@MatejKafka
Copy link
Contributor

MatejKafka commented Aug 22, 2023

@CarloToso Note that both the Content-Disposition filename value and the URL decoded last segment may contain invalid chars (especially / and \, which could lead to nasty path traversal issues, and also the usual set of forbidden characters (Path.GetInvalidFileNameChars())) and should also be checked for DOS special files (COM0-9, AUX,...).

The linked Chromium function has quite a lot of various checks, I'm trying to port the relevant parts to C# now (for an unrelated project), I'll post the final code if I get something usable.

@MatejKafka
Copy link
Contributor

MatejKafka commented Aug 23, 2023

This is what I have now (Windows-only, although only minor modifications should be needed to make it work on Linux), seems to work reasonably intuitively from my testing:

Source code
using System;
using System.IO;
using System.Linq;
using System.Net.Http.Headers;
using System.Text.RegularExpressions;
using System.Web;

namespace Pog.Utils.Http;

/// <summary>
/// Utility class for getting the server-provided file name for a downloaded file.
/// </summary>
public static class HttpFileNameParser {
    /// <param name="resolvedUri">Resolved URI (after redirects) of the downloaded file.</param>
    /// <param name="contentDisposition">The Content-Disposition header from the server response, if any.</param>
    public static string GetDownloadedFileName(Uri resolvedUri, ContentDispositionHeaderValue? contentDisposition) {
        // link to a similar function in Chromium:
        // https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L219
        // Chromium also takes into account the mime type of the response, we're not doing that for now

        // try available names in the order of preference, use the first one found that is valid
        string? fileName = null;

        // if Content-Disposition is set and valid, use the specified filename
        fileName ??= SanitizeDownloadedFileName(GetRawDispositionFileName(contentDisposition));

        // use the last segment of the resolved URL
        var lastSegment = resolvedUri.Segments.LastOrDefault();
        fileName ??= SanitizeDownloadedFileName(HttpUtility.UrlDecode(lastSegment));

        // use the hostname
        fileName ??= SanitizeDownloadedFileName(resolvedUri.Host);

        // fallback default name
        fileName ??= "download";

        return fileName;
    }

    private static string? GetRawDispositionFileName(ContentDispositionHeaderValue? contentDisposition) {
        if (contentDisposition is not {DispositionType: "attachment"}) {
            return null;
        }

        var headerVal = contentDisposition switch {
            {FileNameStar: not null} => contentDisposition.FileNameStar,
            {FileName: not null} => contentDisposition.FileName,
            _ => null,
        };

        if (headerVal == null) {
            return null;
        }

        if (headerVal.StartsWith("\"") && headerVal.EndsWith("\"")) {
            // ContentDispositionHeaderValue parser leaves the quotes in the parsed filename, strip them
            headerVal = headerVal.Substring(1, headerVal.Length - 2);
        }

        return headerVal;
    }

    private static readonly Regex InvalidDosNameRegex = new Regex(@"^(CON|PRN|AUX|NUL|COM\d|LPT\d)(\..+)?$",
            RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);

    private static string? SanitizeDownloadedFileName(string? fileName) {
        // sources for relevant functions in Chromium:
        // GetFileNameFromURL: https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L119
        // GenerateSafeFileName: https://github.com/chromium/chromium/blob/bf9e98c98e8d7e79befeb057fde42b0e320d9b19/net/base/filename_util.cc#L163
        // SanitizeGeneratedFileName: https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L79

        // list of invalid filenames on Windows: https://stackoverflow.com/a/62888

        if (fileName == null) {
            return null;
        }

        // Win32 does not like trailing '.' and ' ', remove it
        fileName = fileName.TrimEnd('.', ' ');
        // replace any invalid characters with _
        fileName = string.Join("_", fileName.Split(Path.GetInvalidFileNameChars()));

        if (InvalidDosNameRegex.IsMatch(fileName)) {
            // is a DOS file name, prefix with _ to lose the special meaning
            fileName = "_" + fileName;
        }

        // if fileName is empty or only consists of invalid chars (or _), skip it
        if (fileName.All(c => c == '_')) {
            return null;
        }
        return fileName;
    }
}
Unit tests
using System.Net.Http.Headers;
using Pog.Utils.Http;
using Xunit;

namespace Pog.Tests;

// list of test cases for Content-Disposition parsing: http://test.greenbytes.de/tech/tc2231/
public class HttpFileNameParserTests {
    [Fact]
    public void TestDosNames() {
        TestSingle("_aux", "https://example.com/aux");
        TestSingle("_AUX.txt", "https://example.com/AUX.txt");
        TestSingle("_COM9", "https://example.com/COM9");
    }

    [Fact]
    public void TestPriority() {
        TestSingle("header", "https://host/segment", "attachment; filename=header");
        TestSingle("segment", "https://host/segment");
        TestSingle("segment", "https://host/segment", "attachment; filename=_");
        TestSingle("host", "https://host/");
        TestSingle("host", "https://host/_", "attachment; filename=_");
    }

    [Fact]
    public void TestSanitization() {
        // filename*
        TestSingle("hello_world.txt", "https://host/segment", "attachment; filename*=utf-8''hello%2Fworld.txt");
        // filename
        TestSingle("hello_world.txt", "https://host/segment", "attachment; filename=\"hello/world.txt\"");
        // segment
        TestSingle(".._hello_world.txt", "https://host/..%2Fhello%2Fworld.txt");
        TestSingle("invalid_chars_.txt", "https://host/invalid%2fchars%0a.txt");
    }

    private static void TestSingle(string expected, string url, string? dispositionHeader = null) {
        var disposition = dispositionHeader == null ? null : ContentDispositionHeaderValue.Parse(dispositionHeader);
        Assert.Equal(expected, HttpFileNameParser.GetDownloadedFileName(new Uri(url), disposition));
    }
}

@ziesemer
Copy link

This remains a valid issue that should be addressed.

@microsoft-github-policy-service microsoft-github-policy-service bot removed the Resolution-No Activity Issue has had no activity for 6 months or more label Feb 19, 2024
@SteveL-MSFT SteveL-MSFT added KeepOpen The bot will ignore these and not auto-close WG-NeedsReview Needs a review by the labeled Working Group labels Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Enhancement the issue is more of a feature request than a bug KeepOpen The bot will ignore these and not auto-close WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module WG-NeedsReview Needs a review by the labeled Working Group
Projects
None yet
Development

No branches or pull requests

9 participants